Basic Info

This package considers the estimation and statistical inference of high-dimensional vector autoregression with measurement error, also known as linear gaussian state-space model. A sparse expectation-maximization (EM) algorithm is provided for parameter estimation. For transition matrix inference, both global testing and simultaneous testing are implemented, with consistent size and false discovery rate (FDR) control. The methods are proposed in Lyu et al. (2020).

Problem setup

The model of interest is high-dimensional vector autoregression (VAR) with measurement error, y_t = x_t + ϵ_t, x_t + 1 = A_*x_t + η_t, where y_t ∈ ℝ^p is the observed multivariate time series, x_t ∈ ℝ^p is the multivariate latent signal that admits an autoregressive structure, ϵ_t ∈ ℝ^p is the measurement error for the observed time series, η_t ∈ ℝ^p is the white noise of the latent signal, and A_* ∈ ℝ^p × p is the sparse transition matrix that encodes the directional relations among the latent signal variables of x_t. Furthermore, we focus on the scenario ∥A_*∥₂ < 1 such that the VAR model of x_t is stationary. The error terms ϵ_t and η_t are i.i.d. multivariate normal with mean zero and covariance σ_ϵ, *²I_p and σ_η, *²I_p, respectively, and are independent of x_t. This package can handle high-dimensional setting where p² exceeds the length of series T.

Estimation aims to recover {A_*, σ_η, *², σ_ϵ, *²} from observation y_t’s. The statistical inference goal is the transition matrix A_*. The global hypotheses is H₀ : A_*, ij = A_0, ij, for all (i, j) ∈ 𝒮 versus H₁ : A_*, ij ≠ A_0, ij, for some (i, j) ∈ 𝒮, for a given A₀ = (A_0, ij) ∈ ℝ^p × p and 𝒮 ⊆ [p] × [p], where [p] = {1, …, p}. The most common choice is A₀ = 0_p × p and 𝒮 = [p] × [p]. The simultaneous hypotheses are H_0; ij : A_*, ij = A_0, ij, versus H_1; ij : A_*, ij ≠ A_0, ij, for all (i, j) ∈ 𝒮.

Methodology

1. Estimation: sparse EM algorithm

Let {y_t, x_t}_t = 1^T denote the complete data, where T is the total number of observations, y_t is observed but x_t is latent. Let Θ = {A, σ_η², σ_ϵ²} collect all the parameters of interest in model , and Θ_* = {A_*, σ_η, *², σ_ϵ, *²} denote the true parameters. The goal is to estimate Θ_* by maximizing the log-likelihood function of the observed data, ℓ(Θ|{y_t}_t = 1^T), with respect to Θ. The computation of ℓ(Θ|{y_t}_t = 1^T), however, is highly nontrivial. Sparse EM algorithm then turns to an auxiliary function, named the finite-sample Q-function, Q_y(Θ|Θ′) = 𝔼[ℓ(Θ|{y_t, x_t}_t = 1^T)|{y_t}_t = 1^T, Θ′], which is defined as the expectation of the log-likelihood function for the complete data ℓ(Θ|{y_t, x_t}_t = 1^T), conditioning on a parameter set Θ′ and the observed data y_t, and the expectation is taken with respect to the latent data x_t. The Q-function can be computed efficiently by Kalman filter and smoothing, and provides a lower bound of the target log-likelihood function ℓ(Θ|{y_t}_t = 1^T) for any Θ. The equality ℓ(Θ′|{y_t}_t = 1^T) = Q_y(Θ′|Θ′) holds if Θ = Θ′. Maximizing Q-function provides an uphill step of the likelihood. Starting from an initial set of parameters Θ̂₀, sparse EM algorithm then alternates between the expectation step (E-step), where the Q-function Q_y(Θ|Θ̂_k) conditioning on the parameters Θ̂_k of the kth iteration is computed, and the maximization step (M-step), where the parameters are updated by maximizing the Q-function Θ̂_k + 1 = arg max_ΘQ_y(Θ|Θ̂_k).

For the M-step, the maximizer of Q_y(Θ|Θ̂_k) satisfies that $\frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t+1;k} = \{\frac{1}{T-1}\sum_{t=1}^{T-1} \mathbf{E}_{t,t;k} \}\mathbf{A}^\top$, where E_t, s; k = 𝔼{x_tx_s^⊤|{y_t′}_t′ = 1^T, Θ̂_k − 1} for s, t ∈ [T] is obtained from the E-step. Instead of directly inverting the matrix involving E_t, t; k’s, which is computationally challenging when the dimension p is high and yields a dense estimator of A_* leading to a divergent statistical error, sparse EM algorithm implements generalized Dantzig selector for Yule-Walker equation,
$$ \hat{\mathbf{A}}_{k} = \arg\min_{\mathbf{A} \in \mathbb{R}^{p\times p}} \|\mathbf{A}\|_1, \;\; \textrm{such that} \; \left\| \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t+1;k} -\frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t;k} \mathbf{A}^\top \right\|_{\max} \le \tau_k, $$ where τ_k is the tolerance parameter that is tuned via cross-validation each iteration (in the observed time series, first Ti_train time points serve as training set, then gap Ti_gap time points, and use the remain as test set). The optimization problem is solved using linear programming in a row-by-row parallel fashion. In the package, an option of further hard thresholding $\hat{\mathbf{A}}_{k}$ is provided to improve model selection performance. Hard thresholding sets entries of magnitude less than threshold level as zero. The variance estimates are next updated as, $$ \begin{align} \label{eqn: epsilon} \begin{split} \hat\sigma_{\eta,k}^2 & = \frac{1}{p(T-1)} \sum_{t=1}^{T-1} \left\{ \mathrm{tr}( \mathbf{E}_{t+1,t+1;k}) - \mathrm{tr}\left ( \hat{\mathbf{A}}_{k} \mathbf{E}_{t,t+1;k} \right) \right\} , \\ \hat\sigma^2_{\epsilon,k} & = \frac{1}{pT } \sum_{t=1}^{T} \left\{ \mathbf{y}_{t}^\top \mathbf{y}_{t} - 2 \mathbf{y}_{t}^\top \mathbf{E}_{t;k} + \mathrm{tr} (\mathbf{E}_{t,t;k}) \right\}, \end{split} \end{align} $$ where E_t; k = 𝔼{x_t|{y_t′}_t′ = 1^T, Θ̂_k − 1} for t ∈ [T], and comes from taking derivative on Q_y(Θ|Θ̂_k). Sparse EM algorithm terminates when reaches the maximal number of iterations or the estimates are close enough in two consecutive iterations, e.g., $\min \left\{ \|\hat{\mathbf{A}}_{k} -\hat{\mathbf{A}}_{k-1} \|_F , | \hat{\sigma}_{\eta,k }-\hat{\sigma}_{\eta,k-1}| ,| \hat{\sigma}_{\epsilon, k}-\hat{\sigma}_{\epsilon, k-1}| \right\} \le 10^{-3}$.

2. Statistical inference

2.a Gaussian test statistic matrix

The fundamental tools of testing is a gausisan test statistic matrix whose entries marginally follow standard normal under null.

The test statistic is constructed as follows. Observation y_t follows an autoregressive structure, y_t + 1 = A_*y_t + e_t, with the error term e_t = −A_*ϵ_t + ϵ_t + 1 + η_t. Then the lag-1 auto-covariance of the error e_t is of the form, Σ_e = Cov(e_t, e_t − 1) = −σ_ϵ, *²A_*. This suggests that we can apply the covariance testing methods on Σ_e to infer transition matrix A_*. However, e_t is not directly observed. Define generic estimates of Θ_* by $\left \{\hat{\mathbf{A}},\hat\sigma_{\epsilon}^2, \hat\sigma_{\eta}^2 \right\}$ (sparse EM estimates also work). We use them to reconstruct this error, and obtain the sample lag-1 auto-covariance estimator, $$\hat{\mathbf{\Sigma}}_e = \frac{1}{T-2} \sum_{t=2}^{T-1} \hat{\mathbf{e}}_{t}\hat{\mathbf{e}}_{t-1}^\top, \ \text{where} \ \ \hat{\mathbf{e}}_{t} = \mathbf{y}_{t+1} - \hat{\mathbf{A}} \mathbf{y}_{t} - \frac{1}{T-1}\sum_{t'=1}^{T-1} (\mathbf{y}_{t'+1} - \hat{\mathbf{A}}\mathbf{y}_{t'}).$$

This sample estimator $\hat{\mathbf{\Sigma}}_e$, nevertheless, involves some bias due to the reconstruction of the error term, and also an inflated variance due to the temporal dependence of the time series data. Bias and variance correction lead to the Gaussian matrix test statistic H, whose (i, j)th entry is,
$$ H_{ij} = \frac{ \sum_{t=2}^{T-1} \{ \hat{e}_{ t,i}\hat{e}_{ t-1,j} + \left( \hat{\sigma}_{\eta}^2 +\hat{\sigma}_{\epsilon}^2 \right) \hat{A}_{ij} - \hat{\sigma}_\eta^2 A_{0,ij} \} }{\sqrt{T-2} \; \hat{\sigma}_{ij}}, \quad i,j \in [p]. $$ Lyu et al. (2020) proves that, under mild assumptions,
$$ \frac{ \sum_{t=2}^{T-1} \{\hat{e}_{ t,i}\hat{e}_{ t-1,j} + \left( \hat{\sigma}_{\eta}^2 +\hat{\sigma}_{\epsilon}^2 \right) \hat{A}_{ij} - \hat{\sigma}_\eta^2 A_{*,ij} \}}{\sqrt{T-2} \; \hat{\sigma}_{ij}}\rightarrow_{d} \mathrm{N}(0, 1) $$ uniformly for i, j ∈ [p] as p, T → ∞.

2.b Global testing

The key insight of global testing is that the squared maximum entry of a zero mean normal vector converges to a Gumbel distribution. Specifically, the global test statistic is
G_𝒮 = max_{(i, j) ∈ 𝒮}H_ij². Lyu et al. (2020) justifies that the asymptotic null distribution of G_𝒮 is Gumbel, $$ \lim_{|\mathcal{S}| \rightarrow \infty} \mathbb{P} \Big( G_\mathcal{S} -2 \log |\mathcal{S}| + \log \log |\mathcal{S}| \le x \Big) = \exp \left\{- \exp (-x/2) / \sqrt{\pi} \right\}. $$ It leads to an asymptotic α-level test, The global null is rejected if Ψ_α = 1.

2.c Simultaneous testing

Let ℋ₀ = {(i, j) : A_*, ij = A_0, ij, (i, j) ∈ 𝒮} denote the set of true null hypotheses, and ℋ₁ = {(i, j) : (i, j) ∈ 𝒮, (i, j) ∉ ℋ₀} denote the set of true alternatives. The test statistic H_ij follows a standard normal distribution when H_0; ij holds, and as such, we reject H_0; ij if |H_ij| > t for some thresholding value t > 0. Let R_𝒮(t) = ∑_{(i, j) ∈ 𝒮}𝟙{|H_ij| > t} denote the number of rejections at t. Then the false discovery proportion (FDP) and the false discovery rate (FDR) in the simultaneous testing problem are, An ideal choice of the threshold t is to reject as many true positives as possible, while controlling the false discovery at the pre-specified level β, that is inf {t > 0 : FDP_𝒮(t) ≤ β}. However, ℋ₀ in FDP_𝒮(t) is unknown. Observing that ℙ(|H_ij| > t) ≈ 2{1 − Φ(t)}, where Φ(⋅) is the cumulative distribution function of a standard normal distribution, the false rejections ∑_{(i, j) ∈ ℋ₀}𝟙{|H_ij| > t} in FDP_𝒮(t) can be approximated by {2 − 2Φ(t)}|𝒮|. Moreover, the search of t is restricted to the range $\left(0, \sqrt{2\log |\mathcal{S}|} \right]$, since $\mathbb{P}\left( \hat{t} \text{ exists in } \left(0, \sqrt{2\log |\mathcal{S}|}\right] \right) \to 1$ as shown in the theoretical justification of Lyu et al. (2020). The simultaneous testing procedure is justified that consistently control FDR, $$ \lim_{|\mathcal{S}| \to \infty} \frac{\text{FDR}_{\mathcal{S}} (\, \hat{t} \; )}{\beta |\mathcal{H}_0|/|\mathcal{S}|} = 1, \quad \textrm{ and } \quad \frac{\text{FDP}_{\mathcal{S}} (\, \hat{t} \; )}{\beta | \mathcal{H}_0|/|\mathcal{S}|}\rightarrow_{p} 1 \;\; \textrm{ as } \; |\mathcal{S}| \to \infty. $$

Quick start

The purpose of this section is to show users the basic usage of this package. We will briefly go through main functions, see what they can do and have a look at outputs. An detailed example of complete procedures of estimation and inference is be presented to give users a general sense of the pakcage.

We first generate observations from the model.

library(hdiVAR)

set.seed(123)
p=3; Ti=400  # dimension and time
A=diag(1,p) # transition matrix
sig_eta=sig_epsilon=0.2 # error std
Y=array(0,dim=c(p,Ti)) #observation t=1, ...., Ti
X=array(0,dim=c(p,Ti)) #latent t=1, ...., T
Ti_burnin=400 # time for burn-in to stationarity
for (t in 1:(Ti+Ti_burnin)) {
 if (t==1){
   x1=rnorm(p)
 } else if (t<=Ti_burnin) { # burn in
   x1=A%*%x1+rnorm(p,mean=0,sd=sig_eta)
 } else if (t==(Ti_burnin+1)){ # time series used for learning
   X[,t-Ti_burnin]=x1
   Y[,t-Ti_burnin]=X[,t-Ti_burnin]+rnorm(p,mean=0,sd=sig_epsilon)
 } else {
   X[,t- Ti_burnin]=A%*%X[,t-1- Ti_burnin]+rnorm(p,mean=0,sd=sig_eta)
   Y[,t- Ti_burnin]=X[,t- Ti_burnin]+rnorm(p,mean=0,sd=sig_epsilon)
 }
}

The first example is sparse EM algorithm.

# cross-validation grid of tolerance parameter \tau_k in Dantzig selector.
tol_seq=c(0.0001,0.0003,0.0005) 

# cross-validation grid of hard thresholding levels in transition matrix estimate. 
# Set as zero to avoid thresholding. The output is \hat{A}_k.
ht_seq=0 


A_init=diag(0.1,p) # initial estimate of A 
# initial estimates of error variances 
sig2_eta_init=sig2_epsilon_init=0.1 

# the first half time points are training data 
Ti_train=Ti*0.5

# The latter 3/10 time points are test data (drop out train (1/2) and gap (1/5) sets).
Ti_gap=Ti*0.2 

# sparse EM algorithm 
sEM_fit=sEM(Y,A_init,sig2_eta_init,sig2_epsilon_init,Ti_train,Ti_gap,tol_seq,ht_seq,is_echo = TRUE)

## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## sparse EM is terminated due to vanishing updates

# estimate of A 
sEM_fit$A_est

##            [,1]         [,2]        [,3]
## [1,] 0.96419601 -0.026679552 0.012691775
## [2,] 0.01440378  0.985762599 0.006283281
## [3,] 0.01642908  0.009632632 0.995205126

# estimate of error variances 
c(sEM_fit$sig2_epsilon_hat,sEM_fit$sig2_eta_hat)

## [1] 0.06429002 0.05989768

The second example is statistical inference.

# use sparse EM estimates to construct test. Alternative consistent estimators can also be adopted if any. 
# test the entire matrix.


# FDR control levels for simultaneous testing 
FDR_levels=c(0.05,0.1)


# if null hypotheses are true (null hypothesis is true A): 
# p-value should > 0.05, and simultaneous testing selects no entries. 
true_null=hdVARtest(Y,sEM_fit$A_est,sEM_fit$sig2_eta_hat,sEM_fit$sig2_epsilon_hat,
                    global_H0=A,global_idx=NULL,simul_H0=A,
                    simul_idx=NULL,FDR_levels=FDR_levels)

# global pvalue: 
true_null$pvalue

## [1] 0.4223181

# selection at FDR=0.05 control level 
true_null$selected[,,FDR_levels==0.05]

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0

# if null hypotheses are false (null hypothesis is zero matrix): 
# p-value should < 0.05, and simultaneous testing selects diagnoal entries. 
false_null=hdVARtest(Y,sEM_fit$A_est,sEM_fit$sig2_eta_hat,sEM_fit$sig2_epsilon_hat,
                     global_H0=matrix(0,p,p),global_idx=NULL,simul_H0=matrix(0,p,p),
                     simul_idx=NULL,FDR_levels=c(0.05,0.1))

# global pvalue: 
false_null$pvalue

## [1] 3.477663e-12

# selection at FDR=0.05 control level 
false_null$selected[,,FDR_levels==0.05]

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Reference

Xiang Lyu, Jian Kang, and Lexin Li. Statistical Inference for High-Dimensional Vector Autoregression with Measurement Error. arXiv preprint arXiv:2009.08011 (2020).

Vignette of R package hdiVAR