rags2ridges is an R-package for fast and proper L2-penalized estimation of precision (and covariance) matrices also called ridge estimation. Its L2-penalty features the ability to shrink towards a target matrix, allowing for incorporation of prior knowledge. Likewise, it also features a fused L2 ridge penalty allows for simultaneous estimation of multiple matrices. The package also contains additional functions for post-processing the L2-penalized estimates — useful for feature selection and when doing graphical modelling. The fused ridge estimation is useful when dealing with grouped data as when doing meta or integrative analysis.
This vignette provides a light introduction on how to get started with regular ridge estimation of precision matrices and further steps.
The README details how to install the rags2ridges package. When installed, the package is loaded as seen below where we also define a function for adding pretty names to a matrix.
The sample variance-covariance matrix, or simply covariance matrix, is well-known and ubiquitous. It is given by
$$ S = \frac{1}{n - 1}XX^T $$
where X is the n × p data matrix that is zero-centered with each p-dimensional observations in the rows. I.e. each row of X is an observation and each column is feature. Often high-dimensional data is organised this way (or transposed).
That X is zero-centered simply means that the column means has been subtracted the columns. The very similar estimate $S = \frac{1}{n}XX^T$ without Bessel’s correction is the maximum likelihood estimate in a multivariate normal model with mean 0 and covariance Σ. The likelihood function in this case is given by
ℓ(Ω; S) = ln |Ω| − tr(SΩ)
where Ω = Σ−1 is the so-called precision matrix (also sometimes called the concentration matrix). It is precisely this Ω for which we seek an estimate we will denote P. Indeed, one can naturally try to use the inverse of S for this:
P = S−1
Let’s try.
The createS()
function can easily simulate covariance
matrices. But we go a more verbose route for illustration:
p <- 6
n <- 20
X <- createS(n = n, p = p, dataset = TRUE)
head(X, n = 4) # Show 4 first of the n rows
## A B C D E F
## [1,] 0.542 0.2898 0.684 -0.6500 -1.31 -0.2182
## [2,] 1.827 0.5769 -0.373 -0.2907 -1.70 2.0908
## [3,] 0.292 -1.0528 0.142 0.5760 1.29 0.0588
## [4,] -0.137 0.0583 -1.351 0.0101 1.74 -0.3099
Here the columns corresponds to features A, B, C, and so on.
When can then arrive a the MLE using covML()
which
centers X (subtracting the column means) and then computes the
estimate:
## A B C D E F
## A 0.4060 0.08218 -0.1071 0.0508 -0.2825 0.14448
## B 0.0822 0.72761 -0.2293 -0.2630 -0.1714 -0.00281
## C -0.1071 -0.22929 1.2783 0.3648 -0.0942 0.10021
## D 0.0508 -0.26304 0.3648 0.8303 0.2383 0.26542
## E -0.2825 -0.17138 -0.0942 0.2383 1.0140 -0.27921
## F 0.1445 -0.00281 0.1002 0.2654 -0.2792 1.22654
Using cov2cor()
the well-known correlation matrix could
be obtained.
By default, createS()
simulates zero-mean i.i.d. normal
variables (corresponding to Σ = Ω = I being
the identity matrix), but it has plenty of possibilities for more
intricate covariance structures. The S
matrix could have
been obtained directly had we omitted the dataset
argument,
leaving it to be the default FALSE
. The
rmvnormal()
function is utilized by createS()
to generate the normal sample.
We can obtain the precision estimate P
using
solve()
to invert S
:
## A B C D E F
## A 3.67523 -0.2638 0.6233 -0.947 1.2621 0.00785
## B -0.26384 1.6404 0.1523 0.459 0.0946 -0.05551
## C 0.62328 0.1523 1.0661 -0.617 0.4654 0.07935
## D -0.94734 0.4595 -0.6173 2.083 -0.8668 -0.48502
## E 1.26209 0.0946 0.4654 -0.867 1.7081 0.38993
## F 0.00785 -0.0555 0.0794 -0.485 0.3899 1.00148
That’s it! Everything goes well here only because n < p. However, when p is close to n, the estimate become unstable and varies wildly and when p exceeds n one can no longer invert S and this strategy fails:
## Error in solve.default(S2): system is computationally singular: reciprocal condition number = 1.38653e-18
Note that this is now a 25 × 25 precision matrix we are trying to estimate. Datasets where p > n are starting to be common, so what now?
To solve the problem, rags2ridges adds a so-called ridge penalty to the likelihood above — this method is also called L2 shrinkage and works by “shrinking” the eigenvalues of S in a particular manner to combat that they “explode” when p ≥ n.
The core problem that rags2ridges solves is that
$$ \ell(\Omega; S) = \ln|\Omega| - \text{tr}(S\Omega) - \frac{\lambda}{2}|| \Omega - T||^2_2 $$ where λ > 0 is the ridge penalty parameter, T is a p × p known target matrix (which we will get back to) and || ⋅ ||2 is the L2-norm. The maximizing solution here is surprisingly on closed form, but it is rather complicated1. Assume for now the target matrix is an all zero matrix and thus out of the equation.
The core function of rags2ridges is
ridgeP
which computes this estimate in a fast manner.
## A B C D E F G
## A 3.08473 -0.1530 0.0885 0.0821 0.12822 0.00221 0.02508
## B -0.15299 2.6738 0.0848 0.0218 0.03395 0.39181 0.23523
## C 0.08855 0.0848 3.2549 -0.2992 -0.03865 0.06315 0.09717
## D 0.08206 0.0218 -0.2992 2.8650 -0.16244 0.13629 0.03174
## E 0.12822 0.0340 -0.0387 -0.1624 2.73440 -0.05099 0.00638
## F 0.00221 0.3918 0.0631 0.1363 -0.05099 2.58199 0.02748
## G 0.02508 0.2352 0.0972 0.0317 0.00638 0.02748 3.05446
And voilà, we have our estimate. We will now discuss the penalty parameters and target matrix and how to choose them.
The penalty parameter λ
(lambda
) shrinks the values of P such toward 0 (when T = 0) — i.e. very larges values of
λ makes P “small” and more stable whereas
smaller values of λ makes the
P tend toward the (possibly
non-existent) S−1.
So what lambda
should you choose? One strategy for choosing
λ is selecting it to be stable
yet precise (a bias-variance trade-off). Automatic k-fold
cross-validation can be done with optPenalty.kCVauto()
is
well suited for this:
Y <- createS(n, p, dataset = TRUE)
opt <- optPenalty.kCVauto(Y, lambdaMin = 0.001, lambdaMax = 100)
str(opt)
## List of 2
## $ optLambda: num 0.33
## $ optPrec : 'ridgeP' num [1:25, 1:25] 3.4337 0.4002 0.193 -0.1459 -0.0785 ...
## ..- attr(*, "lambda")= num 0.33
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:25] "A" "B" "C" "D" ...
## .. ..$ : chr [1:25] "A" "B" "C" "D" ...
As seen, the function returns a list with the optimal penalty parameter and corresponding ridge precision estimate. By default, the the functions performs leave-one-out cross validation. See ?optPenalty.kCVauto` for more information.
The target matrix T is a matrix the same size as P which the estimate is “shrunken” toward — i.e. for large values of λ the estimate goes toward T. The choice of the target is another subject. While one might first think that the all-zeros T = [0] would be a default it is intuitively not a good target. This is because we’d like an estimate that is positive definite (the matrix-equivalent to at positive number) and the null-matrix is not positive definite.
If one has a very good prior estimate or some other information this
might used to construct the target. E.g. the function
kegg.target()
utilizes the Kyoto Encyclopedia of Genes
and Genomes (KEGG) database of gene and gene-networks together with
pilot data to construct a target.
In the absence of such knowledge, the default could be a data-driven
diagonal matrix. The function default.target()
offers some
different approaches to selecting this. A good choice here is often the
diagonal matrix times the reciprocal mean of the eigenvalues of the
sample covariance as entries. See ?default.target
for more
choices.
What is so interesting with the precision matrix anyway? I’m always interested in correlations and thus the correlation matrix.
As you may know, correlation does not imply causation. Nor does covariance imply causation. However, precision matrix provides stronger hints at causation. A relatively simple transformation of P maps it to partial correlations—much like how the sample covariance S easily maps to the correlation matrix. More precisely, the ijth partial correlation is given by
$$ \rho_{ij|\text{all others}} = \frac{- p_{ij}}{\sqrt{p_{ii}p_{jj}}} $$
where pij is the ijth entry of P.
Partial correlations measure the linear association between two random variables whilst removing the effect of other random variables; in this case, it is all the remaining variables. This somewhat absolves the issue in “regular” correlations where are often correlated but only indirectly; either by sort of ‘transitivity of correlations’ (which does not hold generally and is not2 so3 simple4) or by common underlying variables.
OK, but what is graphical about the graphical ridge estimate?
In a multivariate normal model, pij = pji = 0 if and only if Xi and Xj are conditionally independent when condition on all other variables. I.e. Xi and Xj are conditionally independent given all Xk where k ≠ i and k ≠ j if and when the ijth and jith elements of P are zero. In real world applications, this means that P is often relatively sparse (lots of zeros). This also points to the close relationship between P and the partial correlations.
The non-zero entries of the a symmetric PD matrix can them be interpreted the edges of a graph where nodes correspond to the variables.
Graphical ridge estimation? Why not graphical Lasso?
The graphical lasso (gLasso) is the L1-equivalent to graphical ridge. A nice feature of the L1 penalty automatically induces sparsity and thus also select the edges in the underlying graph. The L2 penalty of rags2ridges relies on an extra step that selects the edges after P is estimated. While some may argue this as a drawback (typically due to a lack of perceived simplicity), it is often beneficial to separate the “variable selection” and estimation.
First, a separate post-hoc selection step allows for greater flexibility.
Secondly, when co-linearity is present the L1 penalty is “unstable” in the selection between the items. I.e. if 2 covariances are co-linear only one of them will typically be selected in a unpredictable way whereas the L2 will put equal weight on both and “average” their effect. Ultimately, this means that the L2 estimate is typically more stable than the L1.
At last point to mention here is also that the true underlying graph might not always be very sparse (or sparse at all).
How do I select the edges then?
The sparsify()
functions lets you select the non-zero
entries of P corresponding to edges. It supports a handful different
approaches ranging from simple thresholding to false discovery rate
based selection.
After edge select GGMnetworkStats()
can be utilized to
get summary statistics of the resulting graph topology.
The fullMontyS()
function is a convenience wrapper
getting from the data through the penalized estimate to the
corresponding conditional independence graph and topology summaries.
For a full introduction to the theoretical properties as well as more context to the problem, see van Wieringen & Peeters (2016).
rags2ridges also comes with functionality for
targeted and grouped (or, fused) graphical
ridge regression called the fused graphical ridge. See [2] below. The functions in
this rags2ridges
module are generally post-fixed with
.fused
.
1.. van Wieringen, W.N. and Peeters, C.F.W. (2016). Ridge Estimation of Inverse Covariance Matrices from High-Dimensional Data. Computational Statistics & Data Analysis, vol. 103: 284-303.
2. Bilgrau, A.E., Peeters, C.F.W., Eriksen, P.S., Boegsted, M., and van Wieringen, W.N. (2015). Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes. arXiv:1509.07982 [stat.ME].
Solution for the graphical ridge problem: $$ P(\lambda) = \Bigg\{ \bigg[ \lambda I_{p\times p} + \frac{1}{4}(S - \lambda T)^{2} \bigg]^{1/2} + \frac{1}{2}(S -\lambda T) \Bigg\}^{-1} $$↩︎
https://stats.stackexchange.com/questions/181376/is-correlation-transitive↩︎
https://emilkirkegaard.dk/en/2016/02/causality-transitivity-and-correlation/↩︎
https://terrytao.wordpress.com/2014/06/05/when-is-correlation-transitive/↩︎