--- title: "fRLR: Fit Repeated Linear Regressions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{fRLR: Fit Repeated Linear Regressions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(fRLR) ``` ## Introduction This R package aims to fit **Repeated Linear Regressions** in which there are some same terms. ## An Example Let's start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as $y$, and these regressions are as follows. $$ \begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array} $$ where $cov_i, i=1,\ldots, m$ are the same variables among these regressions. ## Ideas Intuitively, we can finish this task by using a simple loop. ```{r eval=FALSE} model = vector(mode='list', length=n) for (i in 1:n) { ... model[[i]] = lm(y~x) ... } ``` However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter $\beta$. And we have $$ \hat\beta = (X'X)^{-1}X'Y $$ where $X$ is the design matrix and $Y$ is the observation of response variable. It is obvious that there are some same elements in the design matrix, and the larger $m$ is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix. ## Method For the above example, the design matrix can be denoted as $X=(x, cov)$. If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in $cov$ naturally. Then we have $$ (X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right] $$ **Woodbury formula** tells us $$ (A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} $$ Let $$ A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right] $$ and $C=I_{2\times 2}$. Then we can apply woodbury formula, $$ (X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} $$ where $$ A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right] $$ We can do further calculations to simplify and obtain the following result $$ (X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right] $$ Notice that matrix $B$ is the same for all regression, the identical terms for each regression are just $a$ and $v$, which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly. ## Test Now test two simulation examples by using the functions in this package. ```{r} ## use fRLR package set.seed(123) X = matrix(rnorm(50), 10, 5) Y = rnorm(10) COV = matrix(rnorm(40), 10, 4) frlr1(X, Y, COV, num_threads = 1) ## use simple loop res = matrix(nrow = 0, ncol = 2) for (i in 1:ncol(X)) { mat = cbind(X[,i], COV) df = as.data.frame(mat) model = lm(Y~., data = df) tmp = c(i, summary(model)$coefficients[2, 4]) res = rbind(res, tmp) } res ``` As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods. Similarly, we can test another example ```{r} set.seed(123) X = matrix(rnorm(50), 10, 5) Y = rnorm(10) COV = matrix(rnorm(40), 10, 4) idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3) idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5) frlr2(X, idx1, idx2, Y, COV, num_threads = 1) res = matrix(nrow=0, ncol=4) for (i in 1:length(idx1)) { mat = cbind(X[, idx1[i]], X[,idx2[i]], COV) df = as.data.frame(mat) model = lm(Y~., data = df) tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4]) res = rbind(res, tmp) } ``` Again, we obtain the same results by different methods. ## Computation Performance The main aim of this new method is to reduce the computation cost. Now let's compare its speed with the simple-loop method. We can obtain the following time cost for $99\times 100/2=4950$ linear regressions. ```{r} set.seed(123) n = 100 X = matrix(rnorm(10*n), 10, n) Y = rnorm(10) COV = matrix(rnorm(40), 10, 4) #idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3) #idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5) id = combn(n, 2) idx1 = id[1, ] idx2 = id[2, ] system.time(frlr2(X, idx1, idx2, Y, COV, num_threads = 1)) simpleLoop <- function() { res = matrix(nrow=0, ncol=4) for (i in 1:length(idx1)) { mat = cbind(X[, idx1[i]], X[,idx2[i]], COV) df = as.data.frame(mat) model = lm(Y~., data = df) tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4]) res = rbind(res, tmp) } } system.time(simpleLoop()) ``` We can even speed up by passing `num_threads = -1` (use all possible threads).