Title: | Principal Component Analysis in High-Dimensional Data |
---|---|
Description: | In high-dimensional settings: Estimate the number of distant spikes based on the Generalized Spiked Population (GSP) model. Estimate the population eigenvalues, angles between the sample and population eigenvectors, correlations between the sample and population PC scores, and the asymptotic shrinkage factors. Adjust the shrinkage bias in the predicted PC scores. Dey, R. and Lee, S. (2019) <doi:10.1016/j.jmva.2019.02.007>. |
Authors: | Rounak Dey, Seunggeun Lee |
Maintainer: | Rounak Dey <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.5 |
Built: | 2024-10-17 06:35:56 UTC |
Source: | CRAN |
The example dataset is from the Hapmap Phase III project (https://www.ncbi.nlm.nih.gov/variation/news/NCBI_retiring_HapMap/). Our training sample consisted of unrelated individuals from two different populations: a) Utah residents with Northern and Western European ancestry (CEU), and b) Toscans in Italy (TSI). We present the eigenvalues and PC scores obtained from performing PCA on the SNPs on chromosome 7.
This example dataset is a list containing the following elements:
Sample eigenvalues of the training sample.
PC scores of the training sample. This has PC1 and PC2 scores for 198 observations.
We obtained the predicted scores by leaving one observation out at a time, applying PCA to the rest of the data and then predicting the PC score of the left out observation. This has PC1 and PC2 scores of 198 observations.
Number of observations in the training set = 198.
Number of SNPs on chromosome 7.
Estimates the population eigenvalues, angles between the sample and population eigenvectors, correlations between the sample and population PC scores, and the asymptotic shrinkage factors. Three different estimation methods can be used.
hdpc_est(samp.eval, p, n, method = c("d.gsp", "l.gsp", "osp"), n.spikes, n.spikes.max, n.spikes.out, nonspikes.out = FALSE, smooth = TRUE)
hdpc_est(samp.eval, p, n, method = c("d.gsp", "l.gsp", "osp"), n.spikes, n.spikes.max, n.spikes.out, nonspikes.out = FALSE, smooth = TRUE)
samp.eval |
Numeric vector containing the sample eigenvalues. The vector must have dimension |
p |
The number of features. |
n |
The number of samples. |
method |
String specifying the estimation method. Possible values are " |
n.spikes |
Number of distant spikes in the population (Optional). |
n.spikes.max |
Upper bound of the number of distant spikes in the population. Optional, but needed if |
n.spikes.out |
Number of distant spikes to be returned in the output (Optional). If not specified, all the estimated distant spikes are returned. |
nonspikes.out |
Logical. If |
smooth |
Logical. If |
The different choices for method
are:
"d.gsp
": -estimation method based on the Generalized Spiked Population (GSP) model.
"l.gsp
": -estimation method based on the GSP model.
"osp
": Estimation method based on the Ordinary Spiked Population (OSP) model.
At least one of n.spikes
and n.spikes.max
must be provided. If n.spikes
is provided then n.spikes.max
is ignored, else n.spikes.max
is used to find out the number of distant spikes using select.nspike
.
The argument nonspikes.out
is ignored if method="d.gsp"
.
The argument smooth
is useful when the user assumes the population spectral distribution to be continuous.
spikes |
An array of estimated distant spikes. If |
n.spikes |
Number of distant spikes. If |
angles |
An array of estimated cosines of angles between the sample and population eigenvectors corresponding to the distant spikes. The |
correlations |
An array of estimated correlations between the sample and population PC scores corresponding to the distant spikes. The |
shrinkage |
An array of estimated asymptotic shrinkage factors corresponding to the distant spikes. If |
loss |
If |
nonspikes |
If |
Rounak Dey, [email protected]
Dey, R. and Lee, S. (2019). Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model. Journal of Multivariate Analysis, Vol 173, 145-164.
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP m<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE)$n.spikes out<-hdpc_est(train.eval, p, n, method = "d.gsp", n.spikes=m, n.spikes.out=2, nonspikes.out = FALSE) #Output 2 spikes, no non-spike out<-hdpc_est(train.eval, p, n, method = "l.gsp", n.spikes=m, nonspikes.out = FALSE) #Output m many spikes, no non-spike out<-hdpc_est(train.eval, p, n, method = "l.gsp", n.spikes.max=10, nonspikes.out = TRUE) #Output all eigenvalues out<-hdpc_est(train.eval, p, n, method = "osp", n.spikes=m, n.spikes.out=2, nonspikes.out = TRUE) #Output m many spikes, no non-spike ## End(Not run)
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP m<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE)$n.spikes out<-hdpc_est(train.eval, p, n, method = "d.gsp", n.spikes=m, n.spikes.out=2, nonspikes.out = FALSE) #Output 2 spikes, no non-spike out<-hdpc_est(train.eval, p, n, method = "l.gsp", n.spikes=m, nonspikes.out = FALSE) #Output m many spikes, no non-spike out<-hdpc_est(train.eval, p, n, method = "l.gsp", n.spikes.max=10, nonspikes.out = TRUE) #Output all eigenvalues out<-hdpc_est(train.eval, p, n, method = "osp", n.spikes=m, n.spikes.out=2, nonspikes.out = TRUE) #Output m many spikes, no non-spike ## End(Not run)
Adjusts the shrinkage bias in the predicted PC scores based on the estimated shrinkage factors.
pc_adjust(train.eval, p, n, test.scores, method = c("d.gsp", "l.gsp", "osp"), n.spikes, n.spikes.max, smooth = TRUE)
pc_adjust(train.eval, p, n, test.scores, method = c("d.gsp", "l.gsp", "osp"), n.spikes, n.spikes.max, smooth = TRUE)
train.eval |
Numeric vector containing the sample eigenvalues. The vector must have dimension |
p |
The number of features. |
n |
The number of training samples. |
test.scores |
An |
method |
String specifying the estimation method. Possible values are " |
n.spikes |
Number of distant spikes in the population (Optional). |
n.spikes.max |
Upper bound of the number of distant spikes in the population. Optional, but needed if |
smooth |
Logical. If |
The different choices for method
are:
"d.gsp
": -estimation method based on the Generalized Spiked Population (GSP) model.
"l.gsp
": -estimation method based on the GSP model.
"osp
": Estimation method based on the Ordinary Spiked Population (OSP) model.
The element of
test.scores
should denote the predicted PC score for the
subject in the test sample.
At least one of n.spikes
and n.spikes.max
must be provided. If n.spikes
is provided then n.spikes.max
is ignored, else n.spikes.max
is used to find out the number of distant spikes using select.nspike
.
The argument nonspikes.out
is ignored if method="d.gsp"
or "osp
".
The argument smooth
is useful when the user assumes the population spectral distribution to be continuous.
A matrix containing the bias-adjusted PC scores. The dimension of the matrix is the same as the dimension of test.scores
.
A printed message shows the number of top PCs that were adjusted for shrinkage bias.
Rounak Dey, [email protected]
Dey, R. and Lee, S. (2019). Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model. Journal of Multivariate Analysis, Vol 173, 145-164.
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: #First estimate the number of spikes and then adjust test scores based on that train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP trainscore<-hapmap$trainscore testscore<-hapmap$testscore m<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE)$n.spikes score.adj.o1<-pc_adjust(train.eval,p,n,testscore,method="osp",n.spikes=m) score.adj.d1<-pc_adjust(train.eval,p,n,testscore,method="d.gsp",n.spikes=m) score.adj.l1<-pc_adjust(train.eval,p,n,testscore,method="l.gsp",n.spikes=m) #Or you can provide an upper bound n.spikes.max score.adj.o2<-pc_adjust(train.eval,p,n,testscore,method="osp",n.spikes.max=10) score.adj.d2<-pc_adjust(train.eval,p,n,testscore,method="d.gsp",n.spikes.max=10) score.adj.l2<-pc_adjust(train.eval,p,n,testscore,method="l.gsp",n.spikes.max=10) #Plot the training score, test score, and adjusted scores plot(trainscore,pch=19) points(testscore,col='blue',pch=19) points(score.adj.o1,col='red',pch=19) points(score.adj.d2,col='green',pch=19) ## End(Not run)
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: #First estimate the number of spikes and then adjust test scores based on that train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP trainscore<-hapmap$trainscore testscore<-hapmap$testscore m<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE)$n.spikes score.adj.o1<-pc_adjust(train.eval,p,n,testscore,method="osp",n.spikes=m) score.adj.d1<-pc_adjust(train.eval,p,n,testscore,method="d.gsp",n.spikes=m) score.adj.l1<-pc_adjust(train.eval,p,n,testscore,method="l.gsp",n.spikes=m) #Or you can provide an upper bound n.spikes.max score.adj.o2<-pc_adjust(train.eval,p,n,testscore,method="osp",n.spikes.max=10) score.adj.d2<-pc_adjust(train.eval,p,n,testscore,method="d.gsp",n.spikes.max=10) score.adj.l2<-pc_adjust(train.eval,p,n,testscore,method="l.gsp",n.spikes.max=10) #Plot the training score, test score, and adjusted scores plot(trainscore,pch=19) points(testscore,col='blue',pch=19) points(score.adj.o1,col='red',pch=19) points(score.adj.d2,col='green',pch=19) ## End(Not run)
Estimates the number of distant spikes in the population based on the Generalized Spiked Population model. A finite upper bound (n.spikes.max
) of the number of distant spikes must be provided.
select.nspike(samp.eval, p, n, n.spikes.max, evals.out = FALSE, smooth = TRUE)
select.nspike(samp.eval, p, n, n.spikes.max, evals.out = FALSE, smooth = TRUE)
samp.eval |
Numeric vector containing the sample eigenvalues. The vector must have dimension |
p |
The number of features. |
n |
The number of samples. |
n.spikes.max |
Upper bound of the number of distant spikes in the population. |
evals.out |
Logical. If |
smooth |
Logical. If |
The function searches between and
n.spikes.max
to find out the number of distant spikes in the population. It also estimates both non-spiked and spiked eigenvalues based on the -estimation method.
The argument smooth
is useful when the user assumes the population spectral distribution to be continuous.
n.spikes |
Estimated number of distant spikes. |
spikes |
If |
nonspikes |
If |
loss |
If |
Rounak Dey, [email protected]
Dey, R. and Lee, S. (2019). Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model. Journal of Multivariate Analysis, Vol 173, 145-164.
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: #If you just want the estimated number of spikes train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE) #If you want the estimated spikes and non-spikes out<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=TRUE) ## End(Not run)
data(hapmap) #n = 198, p = 75435 for this data #################################################### ## Not run: #If you just want the estimated number of spikes train.eval<-hapmap$train.eval n<-hapmap$nSamp p<-hapmap$nSNP select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=FALSE) #If you want the estimated spikes and non-spikes out<-select.nspike(train.eval,p,n,n.spikes.max=10,evals.out=TRUE) ## End(Not run)