---
title: "fRLR: Fit Repeated Linear Regressions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{fRLR: Fit Repeated Linear Regressions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(fRLR)
```

## Introduction

This R package aims to fit **Repeated Linear Regressions** in which there are some same terms.

## An Example

Let's start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as $y$, and these regressions are as follows.

$$
\begin{array}{ll}
y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\
y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\
\cdot &\sim \cdots\\
y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\
\end{array}
$$

where $cov_i, i=1,\ldots, m$ are the same variables among these regressions.

## Ideas

Intuitively, we can finish this task by using a simple loop.

```{r eval=FALSE}
model = vector(mode='list', length=n)
for (i in 1:n)
{
  ...
  model[[i]] = lm(y~x)
  ...
}
```

However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter $\beta$. And we have

$$
\hat\beta = (X'X)^{-1}X'Y
$$

where $X$ is the design matrix and $Y$ is the observation of response variable.

It is obvious that there are some same elements in the design matrix, and the larger $m$ is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.

## Method

For the above example, the design matrix can be denoted as $X=(x, cov)$. If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in $cov$ naturally. Then we have

$$
(X'X)^{-1}=
\left[
\begin{array}{cc}
x'x & x'cov \\
cov'x & cov'cov
\end{array}
\right]=
\left[
\begin{array}{ll}
a& v'\\
v& B
\end{array}
\right]
$$


**Woodbury formula** tells us

$$
(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}
$$

Let

$$
A=\left[
\begin{array}{ll}
a&O\\
O&B
\end{array}\right],\;
U=\left[
\begin{array}{ll}
1 & 0\\
O & v
\end{array}
\right],\; V=
\left[
\begin{array}{ll}
0& v'\\
1& O
\end{array}
\right]
$$

and $C=I_{2\times 2}$. Then we can apply woodbury formula,

$$
(X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}
$$

where

$$
A^{-1}=\left[
\begin{array}{cc}
a^{-1}&O\\
O&B^{-1}
\end{array}
\right]
$$

We can do further calculations to simplify and obtain the following result

$$
(X'X)^{-1}=\left[
\begin{array}{cc}
1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\
-\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v}
\end{array}
\right]
$$

Notice that matrix $B$ is the same for all regression, the identical terms for each regression are just $a$ and $v$, which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.

## Test

Now test two simulation examples by using the functions in this package.

```{r}
## use fRLR package
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
frlr1(X, Y, COV, num_threads = 1)

## use simple loop
res = matrix(nrow = 0, ncol = 2)
for (i in 1:ncol(X))
{
  mat = cbind(X[,i], COV)
  df = as.data.frame(mat)
  model = lm(Y~., data = df)
  tmp = c(i, summary(model)$coefficients[2, 4])
  res = rbind(res, tmp)
}
res
```

As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.

Similarly, we can test another example

```{r}
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)

idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)

frlr2(X, idx1, idx2, Y, COV, num_threads = 1)

res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
  mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
  df = as.data.frame(mat)
  model = lm(Y~., data = df)
  tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
  res = rbind(res, tmp)
}
```

Again, we obtain the same results by different methods.

## Computation Performance

The main aim of this new method is to reduce the computation cost. Now let's compare its speed with the simple-loop method.

We can obtain the following time cost for $99\times 100/2=4950$ linear regressions.

```{r}
set.seed(123)
n = 100
X = matrix(rnorm(10*n), 10, n)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)

#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
id = combn(n, 2)
idx1 = id[1, ]
idx2 = id[2, ]

system.time(frlr2(X, idx1, idx2, Y, COV, num_threads = 1))

simpleLoop <- function()
{
  res = matrix(nrow=0, ncol=4)
  for (i in 1:length(idx1))
  {
    mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
    df = as.data.frame(mat)
    model = lm(Y~., data = df)
    tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
    res = rbind(res, tmp)
  }
}

system.time(simpleLoop())
```

We can even speed up by passing `num_threads = -1` (use all possible threads).