sparsepca
and amanpg
find sparse loadings
in principal component analysis (PCA) via an alternating manifold
proximal gradient method (A-ManPG). Seeking a sparse basis allows the
leading principal components to be easier to interpret when modeling
with high-dimensional data. PCA is modeled as a regularized regression
problem under the elastic net constraints (linearized L1 and L2 norms)
in order to induce sparsity. Due to the nonsmoothness and nonconvexity
numerical difficulties, A-ManPG is implemented to guarantee
convergence.
The package provides a function for performing sparse PCA and a function for normalizing data.
The authors of A-ManPG are Shixiang Chen, Shiqian Ma, Lingzhou Xue, and Hui Zou. The Python and R packages are maintained by Justin Huang and Benjamin Jochem. A MATLAB implementation is maintained by Shixiang Chen.
The A-ManPG algorithm can be applied to solve the general manifold optimization problem
where H(A, B) is a smooth function of A, B with a Lipschitz continuous gradient, f(⋅) and g(⋅) are lower semicontinuous (possibly nonsmooth) convex functions, and ℳ1, ℳ2 are two embedded submanifolds in the Euclidean space.
For sparse PCA, the following function definitions are used:
where X is the n × p data matrix or n × n covariance matrix,
A is the scores and B is the loadings, k is the rank of the matrices (in
other words, how many principal components are desired), λ1 is the L1 norm penalty
and λ2 is the L2
norm penalty. Both the L1 and L2 norm are used as elastic net
regularization to impose sparseness within the loadings. Note that a
different L1 norm penalty is used for every principal component, and the
algorithm operates differently when the L2 norm penalty is set to a
large constant (np.inf
or Inf
).
The A-ManPG algorithm uses the following subproblems with an alternating updating scheme to solve sparse PCA, computed in a Gauss-Seidel manner for faster convergence.
where Ak + 1 is obtained via a retraction operation (in this case, polar decomposition), t1 ≤ LA, and t2 ≤ LB. LA and LB are the least upper bounds of the Lipschitz constants for ∇AH(A, B) and ∇BH(A, B), respectively. The subproblems are solved using an adaptive semismooth Newton method.
Let ϵ represent a tolerance level to detect convergence. An ϵ-stationary point is defined as a point (A, B) with corresponding DA and DB that satisfy the following:
The algorithm reaches an ϵ-stationary point in at most
$$\frac{2(F(A_0,B_0)-F^*)}{ ((\gamma\bar{\alpha}_1t_1+\gamma\bar{\alpha}_2t_2)\epsilon^2)}$$
iterations, where:
The following describes the algorithm used for solving the general manifold optimization problem using A-ManPG. For solving sparse PCA, the algorithm is implemented with the aforementioned definitions.
Input initial point (A0,B0) and necessary parameters for the required problem
for i=0,1,... do
Solve the first subproblem for Da
Set alpha = 1
while F(Retr(alpha * Da),B) > F(A,B) - alpha * norm(Da)^2 / (2 * t1) do
alpha = gamma * alpha
end while
Set A = Retr(alpha * Da)
Solve the second subproblem for Db
Set alpha = 1
while F(A,Retr(alpha * Db)) > F(A,B) - alpha * norm(Db)^2 / (2 * t2) do
alpha = gamma * alpha
end while
Set B = Retr(alpha * Db)
end for
Return A as the scores and B as the sparse loadings
To install the R package, install amanpg
directly from
CRAN.
To install the Python package, use pip
to obtain
sparsepca
from PyPI.
Name | Python Type | R Type | Description |
---|---|---|---|
z |
numpy.ndarray | matrix | Either the data matrix or sample covariance matrix |
lambda1 |
float list | numeric vector | List of parameters of length n for L1-norm penalty |
lambda2 |
float or numpy.inf | numeric or Inf | L2-norm penalty term |
x0 |
numpy.ndarray | matrix | Initial x-values for the gradient method, default value is the first n right singular vectors |
y0 |
numpy.ndarray | matrix | Initial y-values for the gradient method, default value is the first n right singular vectors |
k |
int | int | Number of principal components desired, default is 0 (returns min(n-1, p) principal components) |
gamma |
float | numeric | Parameter to control how quickly the step size changes in each iteration, default is 0.5 |
type |
int | int | If 0, b is expcted to be a data matrix, and otherwise b is expected to be a covariance matrix; default is 0 |
maxiter |
int | int | Maximum number of iterations allowed in the gradient method, default is 1e4 |
tol |
float | numeric | Tolerance value required to indicate convergence (calculated as difference between iteration f-values), default is 1e-5 |
f_palm |
float | numeric | Upper bound for the F-value to reach convergence, default is 1e5 |
normalize |
bool | logical | Center and normalize rows to Euclidean length 1 if True, default is True |
verbose |
bool | logical | Function prints progress between iterations if True, default is False |
Python returns a dictionary with the following key-value pairs, while R returns a list with the following elements:
Key | Python Value Type | R Value Type | Value |
---|---|---|---|
loadings |
numpy.ndarray | matrix | Loadings of the sparse principal components |
f_manpg |
float | numeric | Final F-value |
x |
numpy.ndarray | matirx | Corresponding ndarray in subproblem to the loadings |
iter |
int | numeric | Total number of iterations executed |
sparsity |
float | numeric | Number of sparse loadings (loadings == 0 ) divided by
number of all loadings |
time |
float | numeric | Execution time in seconds |
Consider the two examples below for running sparse PCA on randomly-generated data: one using a finite λ2, and the other using a large constant λ2.
As with other libraries, begin by loading amanpg
in
R.
Before proceeding, it is helpful to determine a few parameters. Let the rank of the sparse loadings matrix be k = 4 (returning four principal components), the input data matrix be n × p where n = 1000 and p = 500, λ1 be a 4 × 1 “matrix” where λi, 1 = 0.1, and λ2 = 1.
# parameter initialization
k <- 4
n <- 1000
p <- 500
lambda1 <- matrix(data=0.1, nrow=k, ncol=1)
lambda2 <- 1
For this example, the data matrix z
is randomly
generated from the normal distribution. Although it should be centered
to mean 0 and normalized to Euclidean length 1, the function will
automatically preprocess the input matrix when
normalize=TRUE
.
# data matrix generation
set.seed(10)
z <- matrix(rnorm(n * p), n, p)
# only show a subset of the data matrix for brevity
knitr::kable(as.data.frame(z)[1:10,1:4])
V1 | V2 | V3 | V4 |
---|---|---|---|
0.0187462 | 1.0500137 | -0.3078650 | 0.4605151 |
-0.1842525 | 0.2860926 | 0.7580856 | 0.2350253 |
-1.3713305 | 0.2405648 | -0.5738634 | 0.6432573 |
-0.5991677 | 0.8327052 | -0.9387445 | 0.9131981 |
0.2945451 | -0.2229832 | -0.0276993 | 0.9882860 |
0.3897943 | 0.2883442 | -1.0662487 | 0.1127413 |
-1.2080762 | -0.3403921 | -1.3503703 | -1.4900499 |
-0.3636760 | 1.0613346 | 0.0754557 | -0.4432356 |
-1.6266727 | -1.2090489 | -0.9022730 | 1.3623441 |
-0.2564784 | 1.0524069 | 3.6667710 | 1.0452357 |
Alternatively, the data can be normalized beforehand and the
parameter is set to FALSE
in the function call. However,
this example won’t do so, but the output of normalize is displayed
below.
V1 | V2 | V3 | V4 |
---|---|---|---|
0.0003430 | 0.0481267 | -0.0132514 | 0.0219345 |
-0.0089624 | 0.0123909 | 0.0357869 | 0.0112677 |
-0.0647869 | 0.0105393 | -0.0258082 | 0.0306516 |
-0.0295190 | 0.0395046 | -0.0442725 | 0.0446800 |
0.0121050 | -0.0102003 | -0.0001986 | 0.0427144 |
0.0180233 | 0.0129888 | -0.0496851 | 0.0058898 |
-0.0571080 | -0.0166729 | -0.0621594 | -0.0692690 |
-0.0166791 | 0.0465040 | 0.0043809 | -0.0192257 |
-0.0708954 | -0.0530047 | -0.0380530 | 0.0594355 |
-0.0118741 | 0.0459610 | 0.1635723 | 0.0468201 |
Now the function is called, passing through matrix a
,
lambda1
, lambda2
, and our desired rank
k
. The output is stored as a list in
fin_sprout
. Note that if a different initial point is
desired, x0
and y0
should be modified, but the
default value as the first k
right singular vectors is sufficient for this example.
If further printout is desired, set verbose=TRUE
for
progress updates (time, difference for convergence, F value) per iteration.
# function call
fin_sprout <- spca.amanpg(z, lambda1, lambda2, k=4)
print(paste(fin_sprout$iter, "iterations,", fin_sprout$sparsity, "sparsity,", fin_sprout$time))
## [1] "280 iterations, 0.491 sparsity, 0.43851900100708"
The loadings can be viewed from fin_sprout$loadings
.
Note that many entries are set to zero as a result of the induced
sparsity.
# View loadings. Only first 10 rows for brevity
knitr::kable(as.data.frame(fin_sprout$loadings)[1:10,])
V1 | V2 | V3 | V4 |
---|---|---|---|
0.0880974 | 0.0000000 | 0.0000000 | 0.0000000 |
0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
0.0285272 | 0.0324180 | 0.0000000 | 0.0784247 |
0.0000000 | -0.0456984 | 0.0000000 | 0.0906653 |
0.0103526 | -0.0010865 | 0.0000000 | -0.2210455 |
0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
0.0586876 | 0.0000000 | 0.0000000 | 0.0000000 |
0.0012639 | 0.0000000 | 0.0000000 | 0.0996718 |
0.0046010 | 0.0000000 | 0.0000000 | 0.0000000 |
0.0000000 | 0.0012141 | -0.0019561 | 0.0000000 |
The resulting scree plot (Figure 1) looks like this.
pr.var <- (apply(fin_sprout$x, 2, sd))^2
pve <- pr.var / sum(pr.var)
par(mfrow=c(1,2))
plot(pve,
xlab="Sparse PC",
ylab="Proportion of Variance Explained",
ylim=c(0,1),
type="b")
plot(cumsum(pve),
xlab="Sparse PC",
ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),
type="b")
The resulting biplot (Figure 2), with the zero loadings filtered out, can be obtained using the following:
y_sub = apply(fin_sprout$loadings, 1, function(row) all(row != 0))
loadings = fin_sprout$loadings[y_sub, ]
par(mfrow=c(1,1))
biplot(fin_sprout$x, loadings, xlab="PC 1", ylab="PC 2")
Now consider an alternative situation where we set λ2 to a large constant
Inf
. The algorithm changes by directly retracting
B
without using a while loop to determine an appropriate
retraction step size and only iterating A
.
# infinite lambda2
inf_sprout <- spca.amanpg(z, lambda1, lambda2=Inf, k=4)
print(paste(inf_sprout$iter, "iterations,", inf_sprout$sparsity, "sparsity,", inf_sprout$time))
## [1] "344 iterations, 0.253 sparsity, 0.371917009353638"
# extract loadings. Only first 10 rows for brevity
knitr::kable(as.data.frame(inf_sprout$loadings)[1:10,])
V1 | V2 | V3 | V4 |
---|---|---|---|
0.0612782 | 0.0070632 | 0.0148628 | 0.0000000 |
0.0000000 | 0.0080497 | 0.0000000 | 0.0000000 |
0.0000000 | 0.0670739 | -0.0796881 | 0.0616913 |
0.0000000 | 0.0000000 | 0.0050157 | 0.1115940 |
0.0674123 | -0.0194809 | 0.0490605 | -0.1250086 |
0.0000000 | 0.0000000 | 0.0000000 | -0.0207933 |
0.0600291 | 0.0475967 | 0.0000000 | 0.0000000 |
0.0000000 | 0.0428026 | -0.0131852 | 0.0699779 |
0.0156305 | 0.0448186 | 0.0000000 | 0.0000000 |
0.0222292 | 0.0000000 | -0.0138957 | 0.0000000 |
We obtain the scree plot (Figure 3) and the biplot (Figure 4) using the same method.
pr.var <- (apply(inf_sprout$x, 2, sd))^2
pve <- pr.var / sum(pr.var)
par(mfrow=c(1,2))
plot(pve,
xlab="Sparse PC",
ylab="Proportion of Variance Explained",
ylim=c(0,1),
type="b")
plot(cumsum(pve),
xlab="Sparse PC",
ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),
type="b")
For data with high dimensionality, the biplot is still much harder to read with to a lower sparsity value.
y_sub = apply(inf_sprout$loadings, 1, function(row) all(row != 0))
loadings = inf_sprout$loadings[y_sub, ]
par(mfrow=c(1,1))
biplot(inf_sprout$x, loadings, xlab="PC 1", ylab="PC 2")
Note that the Python package depends on numpy.
The following example accomplishes the same situation (down to the same randomly-generated data) but in Python.
import numpy as np
from sparsepca import spca
k = 4 # rank
p = 500 # dimensions
n = 1000 # sample size
lambda1 = 0.1 * np.ones((k, 1))
lambda2 = 1
np.random.seed(10)
z = np.random.normal(0, 1, size=(n, p)) # generate random normal 1000x500 matrix
fin_sprout = spca(z, lambda1, lambda2, k=k)
print(f"Finite: {fin_sprout['iter']} iterations with final value
{fin_sprout['f_manpg']}, sparsity {fin_sprout['sparsity']},
timediff {fin_sprout['time']}.")
fin_sprout['loadings']
inf_sprout = spca(z, lambda1, np.inf, k=k)
print(f"Infinite: {inf_sprout['iter']} iterations with final value
{inf_sprout['f_manpg']}, sparsity {inf_sprout['sparsity']},
timediff {inf_sprout['time']}.")
inf_sprout['loadings']
Chen, S., Ma, S., Xue, L., and Zou, H. (2020) “An Alternating Manifold Proximal Gradient Method for Sparse Principal Component Analysis and Sparse Canonical Correlation Analysis” INFORMS Journal on Optimization 2:3, 192-208 <doi:10.1287/ijoo.2019.0032>.
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286 <doi:10.1198/106186006X113430>.
Zou, H., & Xue, L. (2018). A selective overview of sparse principal component analysis. Proceedings of the IEEE, 106(8), 1311-1320 <doi:10.1109/JPROC.2018.2846588>.