--- title: 'Testing for the number of interesting components in SIR' author: "Klaus Nordhausen, Hannu Oja, David E. Tyler, Joni Virta" date: "`r Sys.Date()`" output: html_document: toc: yes rmarkdown::html_vignette: fig_caption: yes toc: yes toc_depth: 2 pdf_document: fig_caption: yes toc: yes toc_depth: 2 vignette: | %\VignetteIndexEntry{SIR} %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} --- ## Sliced inverse regression Denote the $p$-variate predictors $x_i$, $i=1,...,n$ with corresponding responses $y_i$. The predictors are assumed to follow the model $$ x_i = A s_i + b, $$ where $A$ is a non-singular $p \times p$ matrix, $b$ a $p$-vector and the random vector $s$ can be decomposed into $(s_1^T,s_2^T)^T$ with $E(s)=0$ and $Cov(s)=I_p$ where $s_1$ has dimension $k$ and $s_2$ dimension $p-k$. It is then assumed that $s_1$ is the signal part, the interesting components, which are relevant to model $y$, whereas $s_2$ is the noise part. The working model assumption is then that $$ (y,s_1^T)^T \ is \ independent \ of \ s_2. $$ Defining $S_1=E((x-b)(x-b)^T)$ and $S_2=E(E(x-b|y)E(x-b|y)^T)$ the sliced inverse regression (SIR) can be interpreted as finding the transformation matrix $W$ such that $$ W S_1 W^T = I_p \ and \ W S_2 W^T = D, $$ where $D$ is a diagonal matrix with diagonal elements $d_1 \geq d_2 \geq ... \geq d_k > d_{k+1} = ... = d_p = 0$. In practice $S_2$ is estimated by approximating $E(x-b|y)$ by dividing $y$ into $h$ slices where in this package $y$ is divided into $h$ intervals containing an equal number of observations. The practical problem is to decide then the value $k$. ### Asymptotic test a specific value of $k$ For an asymptotic test using the test statistic $$ T= n\sum_{i=k+1}^p d_i, $$ the limiting distribution under the null is then $T \sim \chi^2_{(p-k)(h-k-1)}$. Therefore for the hypothetical value $k$ and the number of slices $h$ is required that $k \geq h-1$. ### Bootstrapping test a specific value of $k$ Bootstrap tests can be constructed as follows 1. Compute $\hat s_i = \hat W(x_i - \bar x_i)$, and decompose into $\hat s_{i,1}$ and $\hat s_{i,2}$. 2. Sample a bootstrap sample $(y_i^*, \hat s_{i,1}^*)$ of size $n$ from $(y_i, \hat s_{i,1})$. 3. Sample a bootstrap sample $\hat s_{i,2}^*$ of size $n$ from $\hat s_{i,2}$. 4. Compute $x_i^* = \hat W^{-1} ((\hat s_{i,1}^*)^T,(\hat s_{i,2}^*)^T)^T$. 5. Recompute the test statistic $T^*$ based on the bootstrap data $x_i^*$, $i=,1,...,n$. 6. Repeat the steps above $m$ times to obtain bootstrap test statistics $T^*_1,...,T^*_m$. The p-value is then $$ \frac{\#(T_i^* \geq T)+1}{m+1}. $$ ### Example Some simulated data with true $k=2$: ```{r, fig.show = "hold", fig.width = 7, fig.height = 7} set.seed(1234) n <- 200 p <- 10 X <- matrix(rnorm(p*n), ncol = p) eps <- rnorm(n, sd=0.25) y <- X[, 1]/ (0.5+(X[, 2]+1.5)^2) pairs(cbind(y,X)) ``` First performing the asymptotic test ```{r, message = FALSE, warning = FALSE, fig.show = "hold", fig.width = 7, fig.height = 7} library(ICtest) SIRasympk2 <- SIRasymp(X,y,2) screeplot(SIRasympk2) SIRasympk2 ``` Then the bootstrap test ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} SIRbootk2 <- SIRboot(X,y,2) SIRbootk2 ``` Looking at the first two components and their relationship with the response ```{r, fig.show = "hold", fig.width = 7, fig.height = 7} pairs(cbind(y, components(SIRbootk2, which="k"))) ```