This R package aims to fit Repeated Linear Regressions in which there are some same terms.
Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as y, and these regressions are as follows.
$$ \begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array} $$
where covi, i = 1, …, m are the same variables among these regressions.
Intuitively, we can finish this task by using a simple loop.
However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter β. And we have
β̂ = (X′X)−1X′Y
where X is the design matrix and Y is the observation of response variable.
It is obvious that there are some same elements in the design matrix, and the larger m is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.
For the above example, the design matrix can be denoted as X = (x, cov). If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in cov naturally. Then we have
$$ (X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right] $$
Woodbury formula tells us
(A + UCV)−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1
Let
$$ A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right] $$
and C = I2 × 2. Then we can apply woodbury formula,
(X′X)−1 = (A + UCV)−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1
where
$$ A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right] $$
We can do further calculations to simplify and obtain the following result
$$ (X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right] $$
Notice that matrix B is the same for all regression, the identical terms for each regression are just a and v, which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.
Now test two simulation examples by using the functions in this package.
## use fRLR package
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
frlr1(X, Y, COV, num_threads = 1)
#> r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983
## use simple loop
res = matrix(nrow = 0, ncol = 2)
for (i in 1:ncol(X))
{
mat = cbind(X[,i], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(i, summary(model)$coefficients[2, 4])
res = rbind(res, tmp)
}
res
#> [,1] [,2]
#> tmp 1 0.4380128
#> tmp 2 0.7791076
#> tmp 3 0.2212869
#> tmp 4 0.9495018
#> tmp 5 0.6729983
As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.
Similarly, we can test another example
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
frlr2(X, idx1, idx2, Y, COV, num_threads = 1)
#> r1 r2 r1.p.value r2.p.value
#> 1 1 2 0.53021406 0.895719578
#> 2 2 3 0.01812006 0.009833047
#> 3 3 4 0.29895922 0.963995969
#> 4 4 5 0.91749181 0.712075464
#> 5 1 3 0.33761507 0.210331456
#> 6 1 4 0.51074586 0.966484642
#> 7 1 5 0.12479380 0.152802911
#> 8 2 4 0.79302893 0.902402294
#> 9 2 5 0.73153760 0.663392258
#> 10 3 5 0.32367303 0.877154122
res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
res = rbind(res, tmp)
}
Again, we obtain the same results by different methods.
The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method.
We can obtain the following time cost for 99 × 100/2 = 4950 linear regressions.
set.seed(123)
n = 100
X = matrix(rnorm(10*n), 10, n)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
id = combn(n, 2)
idx1 = id[1, ]
idx2 = id[2, ]
system.time(frlr2(X, idx1, idx2, Y, COV, num_threads = 1))
#> user system elapsed
#> 0.017 0.000 0.016
simpleLoop <- function()
{
res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
res = rbind(res, tmp)
}
}
system.time(simpleLoop())
#> user system elapsed
#> 4.794 0.002 4.797
We can even speed up by passing num_threads = -1
(use
all possible threads).