Title: | Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA) |
---|---|
Description: | Contains logic for computing sparse principal components via the EESPCA method, which is based on an approximation of the eigenvector/eigenvalue identity. Includes logic to support execution of the TPower and rifle sparse PCA methods, as well as logic to estimate the sparsity parameters used by EESPCA, TPower and rifle via cross-validation to minimize the out-of-sample reconstruction error. H. Robert Frost (2021) <doi:10.1080/10618600.2021.1987254>. |
Authors: | H. Robert Frost |
Maintainer: | H. Robert Frost <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.7.0 |
Built: | 2024-12-11 06:54:21 UTC |
Source: | CRAN |
Implementation of Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA).
Package: | EESPCA |
Type: | Package |
Version: | 0.7.0 |
Date: | 2021 |
License: | GPL-2 |
This work was supported by the National Institutes of Health grants K01LM012426, R21CA253408, P20GM130454,and P30CA023108.
H. Robert Frost
Frost, H. R. (2022). Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA). Journal of Computational and Graphical Statistics.
Approximates the normed squared eigenvector loadings using a simplified version of the formula associating normed squared eigenvector loadings with the eigenvalues of the full matrix and sub-matrices.
computeApproxNormSquaredEigenvector(cov.X, v1, lambda1, max.iter=5, lambda.diff.threshold=1e-6, trace=FALSE)
computeApproxNormSquaredEigenvector(cov.X, v1, lambda1, max.iter=5, lambda.diff.threshold=1e-6, trace=FALSE)
cov.X |
Covariance matrix. |
v1 |
Principal eigenvector of |
lambda1 |
Largest eigenvalue of |
max.iter |
Maximum number of iterations for power iteration method when computing sub-matrix eigenvalues.
See description |
lambda.diff.threshold |
Threshold for exiting the power iteration calculation.
See description |
trace |
True if debugging messages should be displayed during execution. |
Vector of approximate normed squared eigenvector loadings.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Estimate covariance matrix cov.X = cov(X) # Compute eigenvectors/values eigen.out = eigen(cov.X) v1 = eigen.out$vectors[,1] lambda1 = eigen.out$values[1] # Print true squared loadings v1^2 # Compute approximate normed squared eigenvector loadings computeApproxNormSquaredEigenvector(cov.X=cov.X, v1=v1, lambda1=lambda1)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Estimate covariance matrix cov.X = cov(X) # Compute eigenvectors/values eigen.out = eigen(cov.X) v1 = eigen.out$vectors[,1] lambda1 = eigen.out$values[1] # Print true squared loadings v1^2 # Compute approximate normed squared eigenvector loadings computeApproxNormSquaredEigenvector(cov.X=cov.X, v1=v1, lambda1=lambda1)
Utility function for computing the residual matrix formed by
subtracting from X
a reduced rank approximation of matrix X
generated from
the top k principal components contained in matrix V
.
computeResidualMatrix(X,V,center=TRUE)
computeResidualMatrix(X,V,center=TRUE)
X |
An n-by-p data matrix whose top k principal components are contained in the p-by-k matrix |
V |
A p-by-k matrix containing the loadings for the top k principal components of |
center |
If true (the default), |
Residual matrix.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 residual matrix computeResidualMatrix(X=X, V=prcomp.out$rotation[,1:2])
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 residual matrix computeResidualMatrix(X=X, V=prcomp.out$rotation[,1:2])
Computes the first sparse principal component of the specified data matrix using the Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA) method.
eespca(X, max.iter=20, sparse.threshold, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
eespca(X, max.iter=20, sparse.threshold, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
X |
An n-by-p data matrix for which the first sparse PC will be computed. |
max.iter |
Maximum number of iterations for power iteration method. See |
sparse.threshold |
Threshold on loadings used to induce sparsity.
Loadings below this value are set to 0. If not specified, defaults to |
lambda.diff.threshold |
Threshold for exiting the power iteration calculation.
If the absolute relative difference in lambda is less than this threshold between subsequent iterations,
the power iteration method is terminated. See |
compute.sparse.lambda |
If true, the sparse loadings will be used to compute the sparse eigenvalue. |
sub.mat.max.iter |
Maximum iterations for computation of sub-matrix eigenvalues using the power iteration method. To maximize performance, set to 1. Uses the same lambda.diff.threshold. |
trace |
True if debugging messages should be displayed during execution. |
A list
with the following elements:
"v1": The first non-sparse PC as calculated via power iteration.
"lambda1": The variance of the first non-sparse PC as calculated via power iteration.
"v1.sparse": First sparse PC.
"lambda1.sparse": Variance of the first sparse PC. NA if compute.sparse.lambda is FALSE.
"ratio": Vector of ratios of the sparse to non-sparse PC loadings.
Frost, H. R. (2021). Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA). arXiv e-prints. https://arxiv.org/abs/2006.01924
eespcaForK
,computeApproxNormSquaredEigenvector
, powerIteration
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute first sparse PC loadings using default threshold eespca(X=X)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute first sparse PC loadings using default threshold eespca(X=X)
Performs cross-validation of EESPCA to determine the optimal
sparsity threshold. Selection is based on the minimization of reconstruction error.
Based on the cross-validation approach of Witten et al. as implemented by the SPC.cv
method in the PMA package.
eespcaCV(X, max.iter=20, sparse.threshold.values, nfolds=5, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
eespcaCV(X, max.iter=20, sparse.threshold.values, nfolds=5, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
X |
See description for |
max.iter |
See description for |
sparse.threshold.values |
Vector of threshold values to evaluate via cross-validation.
See description for |
nfolds |
Number of cross-validation folds. |
lambda.diff.threshold |
See description for |
compute.sparse.lambda |
See description for |
sub.mat.max.iter |
See description for |
trace |
See description for |
A list
with the following elements:
"cv": The mean of the out-of-sample reconstruction error computed for each threshold.
"cv.error": The standard deviations of the means of the out-of-sample reconstruction error computed for each threshold.
"best.sparsity": Threshold value with the lowest mean reconstruction error.
"best.sparsity.1se": Threshold value whose mean reconstruction error is within 1 standard error of the lowest.
"nonzerovs": Mean number of nonzero values for each threshold.
"sparse.threshold.values": Tested threshold values.
"nfolds": Number of cross-validation folds.
Frost, H. R. (2021). Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA). arXiv e-prints. https://arxiv.org/abs/2006.01924
Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of threshold values to evaluate default.threshold = 1/sqrt(5) threshold.values = seq(from=.5*default.threshold, to=1.5*default.threshold, length.out=10) # Use 5-fold cross-validation to estimate optimal sparsity threshold eespcaCV(X=X, sparse.threshold.values=threshold.values)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of threshold values to evaluate default.threshold = 1/sqrt(5) threshold.values = seq(from=.5*default.threshold, to=1.5*default.threshold, length.out=10) # Use 5-fold cross-validation to estimate optimal sparsity threshold eespcaCV(X=X, sparse.threshold.values=threshold.values)
Computes multiple sparse principal components of the specified data matrix via sequential application of
the Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA) algorithm.
After computing the first sparse PC via the eespca
function,
subsequent sparse PCs are computing by repeatedly applying eespca
to the residual matrix formed
by subtracting the reconstruction of X
from the original X
.
Multiple sparse PCs are not guaranteed to be orthogonal.
Note that the accuracy of the sparse approximation declines substantially for PCs with very small
variances. To avoid this issue, k
should not be set higher than the number of statistically
significant PCs according to a Tracey-Widom test.
eespcaForK(X, k=2, max.iter=20, sparse.threshold, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
eespcaForK(X, k=2, max.iter=20, sparse.threshold, lambda.diff.threshold=1e-6, compute.sparse.lambda=FALSE, sub.mat.max.iter=5, trace=FALSE)
X |
An n-by-p data matrix for which the first |
k |
The number of sparse PCs to compute. The specified k must be 2 or greater (for k=1, use
the |
max.iter |
See description for |
sparse.threshold |
See description for |
lambda.diff.threshold |
See description for |
compute.sparse.lambda |
See description for |
sub.mat.max.iter |
See description for |
trace |
See description for |
A list
with the following elements:
"V": Matrix of sparse loadings for the first k PCs.
"lambdas": Vector of variances of the first k sparse PCs.
Frost, H. R. (2021). Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA). arXiv e-prints. https://arxiv.org/abs/2006.01924
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Get first two sparse PCs eespcaForK(X=X, sparse.threshold=1/sqrt(5), k=2)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Get first two sparse PCs eespcaForK(X=X, sparse.threshold=1/sqrt(5), k=2)
Computes the principal eigenvector and eigenvalue of the specified matrix using the power iteration method. Includes support for truncating the estimated eigenvector on each iteration to retain just the k eigenvector loadings with the largest absolute values with all other values set to 0, i.e., the the TPower method by Yuan & Zhang.
powerIteration(X, k, v1.init, max.iter=10, lambda.diff.threshold=1e-6, trace=FALSE)
powerIteration(X, k, v1.init, max.iter=10, lambda.diff.threshold=1e-6, trace=FALSE)
X |
Matrix for which the largest eigenvector and eigenvalue will be computed. |
k |
If specified, the estimated eigenvector is truncated on each iteration to retain only the k loadings with the largest absolute values, all other loadings are set to 0. Must be an integer between 1 and ncol(X). |
v1.init |
If specified, the power iteration calculation will be initialized using this vector, otherwise, the calculation will be initialized using a unit vector with equal values. |
max.iter |
Maximum number of iterations for power iteration method. |
lambda.diff.threshold |
Threshold for exiting the power iteration calculation. If the absolute relative difference in computed eigenvalue is less than this threshold between subsequent iterations, the power iteration method is terminated. |
trace |
True if debugging messages should be displayed during execution. |
A list
with the following elements:
"v1": The principal eigenvector of X
.
"lambda": The largest eigenvalue of X
.
"num.iter": Number of iterations of the power iteration method before termination.
Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res., 14(1), 899-925.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute sample covariance matrix cov.X = cov(X) # Use power iteration to get first PC loadings using default initial vector powerIteration(X=cov.X)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute sample covariance matrix cov.X = cov(X) # Use power iteration to get first PC loadings using default initial vector powerIteration(X=cov.X)
Utility function for computing the reduced rank reconstruction of X
using the PC
loadings in V
.
reconstruct(X,V,center=TRUE)
reconstruct(X,V,center=TRUE)
X |
An n-by-p data matrix whose top k principal components are contained the p-by-k matrix |
V |
A p-by-k matrix containing the loadings for the top k principal components of |
center |
If true (the default), |
Reduced rank reconstruction of X.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 reconstruction reconstruct(X, prcomp.out$rotation[,1:2])
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 reconstruction reconstruct(X, prcomp.out$rotation[,1:2])
Utility function for computing the squared Frobenius norm of the residual matrix formed by
subtracting from X
a reduced rank approximation of matrix X
generated from
the top k principal components contained in matrix V
.
reconstructionError(X,V,center=TRUE)
reconstructionError(X,V,center=TRUE)
X |
An n-by-p data matrix whose top k principal components are contained the p-by-k matrix |
V |
A p-by-k matrix containing the loadings for the top k principal components of |
center |
If true (the default), |
The squared Frobenius norm of the residual matrix.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 reconstruction error, which will be the minimum since the first 2 PCs are used reconstructionError(X, prcomp.out$rotation[,1:2]) # Use all PCs to get approximately 0 reconstruction error reconstructionError(X, prcomp.out$rotation)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Perform PCA prcomp.out = prcomp(X) # Get rank 2 reconstruction error, which will be the minimum since the first 2 PCs are used reconstructionError(X, prcomp.out$rotation[,1:2]) # Use all PCs to get approximately 0 reconstruction error reconstructionError(X, prcomp.out$rotation)
Computes the initial eigenvector for the rifle method of Tan et al. (as implemented by the
rifle
method in the rifle R package) using the initial.convex
method
from the rifle package with lambda=sqrt(log(p)/n) and K=1.
rifleInit(X)
rifleInit(X)
X |
n-by-p data matrix to be evaluated via PCA. |
Initial eigenvector to use with rifle method.
Tan, K. M., Wang, Z., Liu, H., and Zhang, T. (2018). Sparse generalized eigenvalue problem: optimal statistical rates via truncated rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057-1086.
riflePCACV
, rifle{rifle}
, rifle{initial.convex}
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute initial eigenvector to use with rifle method v1.init = rifleInit(X) # Use with rifle method to get first PC loadings with 2 non-zero elements rifle(A=cov(X), B=diag(5), init=v1.init, k=2)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute initial eigenvector to use with rifle method v1.init = rifleInit(X) # Use with rifle method to get first PC loadings with 2 non-zero elements rifle(A=cov(X), B=diag(5), init=v1.init, k=2)
Sparsity parameter selection for PCA-based rifle (as implemented by the
rifle
method in the rifle package) using the cross-validation
approach of Witten et al. as implemented by the SPC.cv
method in the PMA package.
riflePCACV(X, k.values, nfolds=5)
riflePCACV(X, k.values, nfolds=5)
X |
n-by-p data matrix being evaluated via PCA. |
k.values |
Set of truncation parameter values to evaluate via cross-validation. Values must be between 1 and p. |
nfolds |
Number of folds for cross-validation |
k value that generated the smallest cross-validation error.
Tan, K. M., Wang, Z., Liu, H., and Zhang, T. (2018). Sparse generalized eigenvalue problem: optimal statistical rates via truncated rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057-1086.
Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.
rifleInit
, rifle{rifle}
, PMA{SPC.cv}
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of k values to evaluate k.values = 1:5 # Use 5-fold cross-validation to estimate optimal k value riflePCACV(X=X, k.values=k.values)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of k values to evaluate k.values = 1:5 # Use 5-fold cross-validation to estimate optimal k value riflePCACV(X=X, k.values=k.values)
Implements the TPower method by Yuan and Zhang. Specifically, it computes the sparse principal eigenvector using power iteration method where the estimated eigenvector is truncated on each iteration to retain just the k eigenvector loadings with the largest absolute values with all other values set to 0.
tpower(X, k, v1.init, max.iter=10, lambda.diff.threshold=1e-6, trace=FALSE)
tpower(X, k, v1.init, max.iter=10, lambda.diff.threshold=1e-6, trace=FALSE)
X |
Matrix for which the largest eigenvector and eigenvalue will be computed. |
k |
Must be an integer between 1 and ncol(X). The estimated eigenvector is truncated on each iteration to retain only the k loadings with the largest absolute values, all other loadings are set to 0. |
v1.init |
If specified, the power iteration calculation will be initialized using this vector, otherwise, the calculation will be initialized using a unit vector with equal values. |
max.iter |
Maximum number of iterations for power iteration method. |
lambda.diff.threshold |
Threshold for exiting the power iteration calculation. If the absolute relative difference in computed eigenvalues is less than this threshold between subsequent iterations, the power iteration method is terminated. |
trace |
True if debugging messages should be displayed during execution. |
The estimated sparse principal eigenvector.
Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res., 14(1), 899-925.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute first sparse PC loadings with 2 non-zero elements tpower(X=cov(X), k=2)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Compute first sparse PC loadings with 2 non-zero elements tpower(X=cov(X), k=2)
Sparsity parameter selection for PCA-based TPower using the cross-validation
approach of Witten et al. as implemented by the SPC.cv
method in the PMA package.
tpowerPCACV(X, k.values, nfolds=5)
tpowerPCACV(X, k.values, nfolds=5)
X |
n-by-p data matrix being evaluated via PCA. |
k.values |
Set of truncation parameter values to evaluate via cross-validation. Values must be between 1 and p. |
nfolds |
Number of folds for cross-validation |
k value that generated the smallest cross-validation error.
Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res., 14(1), 899-925.
Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of k values to evaluate k.values = 1:5 # Use 5-fold cross-validation to estimate optimal k value tpowerPCACV(X=X, k.values=k.values)
set.seed(1) # Simulate 10x5 MVN data matrix X=matrix(rnorm(50), nrow=10) # Generate range of k values to evaluate k.values = 1:5 # Use 5-fold cross-validation to estimate optimal k value tpowerPCACV(X=X, k.values=k.values)