Vignette of R package hdiVAR

Basic info

Problem setup

Methodology

Quick Start

Reference

Basic Info

This package considers the estimation and statistical inference of high-dimensional vector autoregression with measurement error, also known as linear gaussian state-space model. A sparse expectation-maximization (EM) algorithm is provided for parameter estimation. For transition matrix inference, both global testing and simultaneous testing are implemented, with consistent size and false discovery rate (FDR) control. The methods are proposed in Lyu et al. (2020).

Problem setup

The model of interest is high-dimensional vector autoregression (VAR) with measurement error, yt = xt + ϵt,    xt + 1 = A*xt + ηt, where yt ∈ ℝp is the observed multivariate time series, xt ∈ ℝp is the multivariate latent signal that admits an autoregressive structure, ϵt ∈ ℝp is the measurement error for the observed time series, ηt ∈ ℝp is the white noise of the latent signal, and A* ∈ ℝp × p is the sparse transition matrix that encodes the directional relations among the latent signal variables of xt. Furthermore, we focus on the scenario A*2 < 1 such that the VAR model of xt is stationary. The error terms ϵt and ηt are i.i.d. multivariate normal with mean zero and covariance σϵ, *2Ip and ση, *2Ip, respectively, and are independent of xt. This package can handle high-dimensional setting where p2 exceeds the length of series T.

Estimation aims to recover {A*, ση, *2, σϵ, *2} from observation yt’s. The statistical inference goal is the transition matrix A*. The global hypotheses is H0 : A*, ij = A0, ij,  for all (i, j) ∈ 𝒮  versus  H1 : A*, ij ≠ A0, ij,  for some (i, j) ∈ 𝒮, for a given A0 = (A0, ij) ∈ ℝp × p and 𝒮 ⊆ [p] × [p], where [p] = {1, …, p}. The most common choice is A0 = 0p × p and 𝒮 = [p] × [p]. The simultaneous hypotheses are H0; ij : A*, ij = A0, ij,  versus  H1; ij : A*, ij ≠ A0, ij,  for all (i, j) ∈ 𝒮.

Methodology

1. Estimation: sparse EM algorithm

Let {yt, xt}t = 1T denote the complete data, where T is the total number of observations, yt is observed but xt is latent. Let Θ = {A, ση2, σϵ2} collect all the parameters of interest in model , and Θ* = {A*, ση, *2, σϵ, *2} denote the true parameters. The goal is to estimate Θ* by maximizing the log-likelihood function of the observed data, ℓ(Θ|{yt}t = 1T), with respect to Θ. The computation of ℓ(Θ|{yt}t = 1T), however, is highly nontrivial. Sparse EM algorithm then turns to an auxiliary function, named the finite-sample Q-function, Qy(Θ|Θ′) = 𝔼[ℓ(Θ|{yt, xt}t = 1T)|{yt}t = 1T, Θ′], which is defined as the expectation of the log-likelihood function for the complete data ℓ(Θ|{yt, xt}t = 1T), conditioning on a parameter set Θ and the observed data yt, and the expectation is taken with respect to the latent data xt. The Q-function can be computed efficiently by Kalman filter and smoothing, and provides a lower bound of the target log-likelihood function ℓ(Θ|{yt}t = 1T) for any Θ. The equality ℓ(Θ′|{yt}t = 1T) = Qy(Θ′|Θ′) holds if Θ = Θ. Maximizing Q-function provides an uphill step of the likelihood. Starting from an initial set of parameters Θ̂0, sparse EM algorithm then alternates between the expectation step (E-step), where the Q-function Qy(Θ|Θ̂k) conditioning on the parameters Θ̂k of the kth iteration is computed, and the maximization step (M-step), where the parameters are updated by maximizing the Q-function Θ̂k + 1 = arg maxΘQy(Θ|Θ̂k).

For the M-step, the maximizer of Qy(Θ|Θ̂k) satisfies that $\frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t+1;k} = \{\frac{1}{T-1}\sum_{t=1}^{T-1} \mathbf{E}_{t,t;k} \}\mathbf{A}^\top$, where Et, s; k = 𝔼{xtxs|{yt}t′ = 1T, Θ̂k − 1} for s, t ∈ [T] is obtained from the E-step. Instead of directly inverting the matrix involving Et, t; k’s, which is computationally challenging when the dimension p is high and yields a dense estimator of A* leading to a divergent statistical error, sparse EM algorithm implements generalized Dantzig selector for Yule-Walker equation,
$$ \hat{\mathbf{A}}_{k} = \arg\min_{\mathbf{A} \in \mathbb{R}^{p\times p}} \|\mathbf{A}\|_1, \;\; \textrm{such that} \; \left\| \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t+1;k} -\frac{1}{T-1} \sum_{t=1}^{T-1} \mathbf{E}_{t,t;k} \mathbf{A}^\top \right\|_{\max} \le \tau_k, $$ where τk is the tolerance parameter that is tuned via cross-validation each iteration (in the observed time series, first Ti_train time points serve as training set, then gap Ti_gap time points, and use the remain as test set). The optimization problem is solved using linear programming in a row-by-row parallel fashion. In the package, an option of further hard thresholding $\hat{\mathbf{A}}_{k}$ is provided to improve model selection performance. Hard thresholding sets entries of magnitude less than threshold level as zero. The variance estimates are next updated as, $$ \begin{align} \label{eqn: epsilon} \begin{split} \hat\sigma_{\eta,k}^2 & = \frac{1}{p(T-1)} \sum_{t=1}^{T-1} \left\{ \mathrm{tr}( \mathbf{E}_{t+1,t+1;k}) - \mathrm{tr}\left ( \hat{\mathbf{A}}_{k} \mathbf{E}_{t,t+1;k} \right) \right\} , \\ \hat\sigma^2_{\epsilon,k} & = \frac{1}{pT } \sum_{t=1}^{T} \left\{ \mathbf{y}_{t}^\top \mathbf{y}_{t} - 2 \mathbf{y}_{t}^\top \mathbf{E}_{t;k} + \mathrm{tr} (\mathbf{E}_{t,t;k}) \right\}, \end{split} \end{align} $$ where Et; k = 𝔼{xt|{yt}t′ = 1T, Θ̂k − 1} for t ∈ [T], and comes from taking derivative on Qy(Θ|Θ̂k). Sparse EM algorithm terminates when reaches the maximal number of iterations or the estimates are close enough in two consecutive iterations, e.g., $\min \left\{ \|\hat{\mathbf{A}}_{k} -\hat{\mathbf{A}}_{k-1} \|_F , | \hat{\sigma}_{\eta,k }-\hat{\sigma}_{\eta,k-1}| ,| \hat{\sigma}_{\epsilon, k}-\hat{\sigma}_{\epsilon, k-1}| \right\} \le 10^{-3}$.

2. Statistical inference

2.a Gaussian test statistic matrix

The fundamental tools of testing is a gausisan test statistic matrix whose entries marginally follow standard normal under null.

The test statistic is constructed as follows. Observation yt follows an autoregressive structure, yt + 1 = A*yt + et, with the error term et = −A*ϵt + ϵt + 1 + ηt. Then the lag-1 auto-covariance of the error et is of the form, Σe = Cov(et, et − 1) = −σϵ, *2A*. This suggests that we can apply the covariance testing methods on Σe to infer transition matrix A*. However, et is not directly observed. Define generic estimates of Θ* by $\left \{\hat{\mathbf{A}},\hat\sigma_{\epsilon}^2, \hat\sigma_{\eta}^2 \right\}$ (sparse EM estimates also work). We use them to reconstruct this error, and obtain the sample lag-1 auto-covariance estimator, $$\hat{\mathbf{\Sigma}}_e = \frac{1}{T-2} \sum_{t=2}^{T-1} \hat{\mathbf{e}}_{t}\hat{\mathbf{e}}_{t-1}^\top, \ \text{where} \ \ \hat{\mathbf{e}}_{t} = \mathbf{y}_{t+1} - \hat{\mathbf{A}} \mathbf{y}_{t} - \frac{1}{T-1}\sum_{t'=1}^{T-1} (\mathbf{y}_{t'+1} - \hat{\mathbf{A}}\mathbf{y}_{t'}).$$

This sample estimator $\hat{\mathbf{\Sigma}}_e$, nevertheless, involves some bias due to the reconstruction of the error term, and also an inflated variance due to the temporal dependence of the time series data. Bias and variance correction lead to the Gaussian matrix test statistic H, whose (i, j)th entry is,
$$ H_{ij} = \frac{ \sum_{t=2}^{T-1} \{ \hat{e}_{ t,i}\hat{e}_{ t-1,j} + \left( \hat{\sigma}_{\eta}^2 +\hat{\sigma}_{\epsilon}^2 \right) \hat{A}_{ij} - \hat{\sigma}_\eta^2 A_{0,ij} \} }{\sqrt{T-2} \; \hat{\sigma}_{ij}}, \quad i,j \in [p]. $$ Lyu et al. (2020) proves that, under mild assumptions,
$$ \frac{ \sum_{t=2}^{T-1} \{\hat{e}_{ t,i}\hat{e}_{ t-1,j} + \left( \hat{\sigma}_{\eta}^2 +\hat{\sigma}_{\epsilon}^2 \right) \hat{A}_{ij} - \hat{\sigma}_\eta^2 A_{*,ij} \}}{\sqrt{T-2} \; \hat{\sigma}_{ij}}\rightarrow_{d} \mathrm{N}(0, 1) $$ uniformly for i, j ∈ [p] as p, T → ∞.

2.b Global testing

The key insight of global testing is that the squared maximum entry of a zero mean normal vector converges to a Gumbel distribution. Specifically, the global test statistic is
G𝒮 = max(i, j) ∈ 𝒮Hij2. Lyu et al. (2020) justifies that the asymptotic null distribution of G𝒮 is Gumbel, $$ \lim_{|\mathcal{S}| \rightarrow \infty} \mathbb{P} \Big( G_\mathcal{S} -2 \log |\mathcal{S}| + \log \log |\mathcal{S}| \le x \Big) = \exp \left\{- \exp (-x/2) / \sqrt{\pi} \right\}. $$ It leads to an asymptotic α-level test, The global null is rejected if Ψα = 1.

2.c Simultaneous testing

Let 0 = {(i, j) : A*, ij = A0, ij, (i, j) ∈ 𝒮} denote the set of true null hypotheses, and 1 = {(i, j) : (i, j) ∈ 𝒮, (i, j) ∉ ℋ0} denote the set of true alternatives. The test statistic Hij follows a standard normal distribution when H0; ij holds, and as such, we reject H0; ij if |Hij| > t for some thresholding value t > 0. Let R𝒮(t) = ∑(i, j) ∈ 𝒮𝟙{|Hij| > t} denote the number of rejections at t. Then the false discovery proportion (FDP) and the false discovery rate (FDR) in the simultaneous testing problem are, An ideal choice of the threshold t is to reject as many true positives as possible, while controlling the false discovery at the pre-specified level β, that is inf {t > 0 : FDP𝒮(t) ≤ β}. However, 0 in FDP𝒮(t) is unknown. Observing that ℙ(|Hij| > t) ≈ 2{1 − Φ(t)}, where Φ(⋅) is the cumulative distribution function of a standard normal distribution, the false rejections (i, j) ∈ ℋ0𝟙{|Hij| > t} in FDP𝒮(t) can be approximated by {2 − 2Φ(t)}|𝒮|. Moreover, the search of t is restricted to the range $\left(0, \sqrt{2\log |\mathcal{S}|} \right]$, since $\mathbb{P}\left( \hat{t} \text{ exists in } \left(0, \sqrt{2\log |\mathcal{S}|}\right] \right) \to 1$ as shown in the theoretical justification of Lyu et al. (2020). The simultaneous testing procedure is justified that consistently control FDR, $$ \lim_{|\mathcal{S}| \to \infty} \frac{\text{FDR}_{\mathcal{S}} (\, \hat{t} \; )}{\beta |\mathcal{H}_0|/|\mathcal{S}|} = 1, \quad \textrm{ and } \quad \frac{\text{FDP}_{\mathcal{S}} (\, \hat{t} \; )}{\beta | \mathcal{H}_0|/|\mathcal{S}|}\rightarrow_{p} 1 \;\; \textrm{ as } \; |\mathcal{S}| \to \infty. $$

Quick start

The purpose of this section is to show users the basic usage of this package. We will briefly go through main functions, see what they can do and have a look at outputs. An detailed example of complete procedures of estimation and inference is be presented to give users a general sense of the pakcage.

We first generate observations from the model.

library(hdiVAR)

set.seed(123)
p=3; Ti=400  # dimension and time
A=diag(1,p) # transition matrix
sig_eta=sig_epsilon=0.2 # error std
Y=array(0,dim=c(p,Ti)) #observation t=1, ...., Ti
X=array(0,dim=c(p,Ti)) #latent t=1, ...., T
Ti_burnin=400 # time for burn-in to stationarity
for (t in 1:(Ti+Ti_burnin)) {
 if (t==1){
   x1=rnorm(p)
 } else if (t<=Ti_burnin) { # burn in
   x1=A%*%x1+rnorm(p,mean=0,sd=sig_eta)
 } else if (t==(Ti_burnin+1)){ # time series used for learning
   X[,t-Ti_burnin]=x1
   Y[,t-Ti_burnin]=X[,t-Ti_burnin]+rnorm(p,mean=0,sd=sig_epsilon)
 } else {
   X[,t- Ti_burnin]=A%*%X[,t-1- Ti_burnin]+rnorm(p,mean=0,sd=sig_eta)
   Y[,t- Ti_burnin]=X[,t- Ti_burnin]+rnorm(p,mean=0,sd=sig_epsilon)
 }
}

The first example is sparse EM algorithm.

# cross-validation grid of tolerance parameter \tau_k in Dantzig selector.
tol_seq=c(0.0001,0.0003,0.0005) 

# cross-validation grid of hard thresholding levels in transition matrix estimate. 
# Set as zero to avoid thresholding. The output is \hat{A}_k.
ht_seq=0 


A_init=diag(0.1,p) # initial estimate of A 
# initial estimates of error variances 
sig2_eta_init=sig2_epsilon_init=0.1 

# the first half time points are training data 
Ti_train=Ti*0.5

# The latter 3/10 time points are test data (drop out train (1/2) and gap (1/5) sets).
Ti_gap=Ti*0.2 

# sparse EM algorithm 
sEM_fit=sEM(Y,A_init,sig2_eta_init,sig2_epsilon_init,Ti_train,Ti_gap,tol_seq,ht_seq,is_echo = TRUE)
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## [1] "CV-tuned (tol, ht) is in (1, 1)/(3, 1) of the parameter grid."
## sparse EM is terminated due to vanishing updates
# estimate of A 
sEM_fit$A_est 
##            [,1]         [,2]        [,3]
## [1,] 0.96419601 -0.026679552 0.012691775
## [2,] 0.01440378  0.985762599 0.006283281
## [3,] 0.01642908  0.009632632 0.995205126
# estimate of error variances 
c(sEM_fit$sig2_epsilon_hat,sEM_fit$sig2_eta_hat) 
## [1] 0.06429002 0.05989768

The second example is statistical inference.

# use sparse EM estimates to construct test. Alternative consistent estimators can also be adopted if any. 
# test the entire matrix.


# FDR control levels for simultaneous testing 
FDR_levels=c(0.05,0.1)


# if null hypotheses are true (null hypothesis is true A): 
# p-value should > 0.05, and simultaneous testing selects no entries. 
true_null=hdVARtest(Y,sEM_fit$A_est,sEM_fit$sig2_eta_hat,sEM_fit$sig2_epsilon_hat,
                    global_H0=A,global_idx=NULL,simul_H0=A,
                    simul_idx=NULL,FDR_levels=FDR_levels)

# global pvalue: 
true_null$pvalue
## [1] 0.4223181
# selection at FDR=0.05 control level 
true_null$selected[,,FDR_levels==0.05]
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0
# if null hypotheses are false (null hypothesis is zero matrix): 
# p-value should < 0.05, and simultaneous testing selects diagnoal entries. 
false_null=hdVARtest(Y,sEM_fit$A_est,sEM_fit$sig2_eta_hat,sEM_fit$sig2_epsilon_hat,
                     global_H0=matrix(0,p,p),global_idx=NULL,simul_H0=matrix(0,p,p),
                     simul_idx=NULL,FDR_levels=c(0.05,0.1))

# global pvalue: 
false_null$pvalue
## [1] 3.477663e-12
# selection at FDR=0.05 control level 
false_null$selected[,,FDR_levels==0.05]
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Back to Top

Reference

  1. Xiang Lyu, Jian Kang, and Lexin Li. Statistical Inference for High-Dimensional Vector Autoregression with Measurement Error. arXiv preprint arXiv:2009.08011 (2020).