Title: | Finding the Number of Significant Principal Components |
---|---|
Description: | Implements methods to automate the Auer-Gervini graphical Bayesian approach for determining the number of significant principal components. Automation uses clustering, change points, or simple statistical models to distinguish "long" from "short" steps in a graph showing the posterior number of components as a function of a prior parameter. See <doi:10.1101/237883>. |
Authors: | Kevin R. Coombes, Min Wang |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 1.1.13 |
Built: | 2025-01-06 06:38:41 UTC |
Source: | CRAN |
Auer and Gervini developed a Bayesian graphical method to determine
the number of significant principal components; a brief
overview is included in the help for the
AuerGervini
class. The output of their method is a step function that displays
the maximum a posteriori (MAP) choice of as a step function of
a one-parameter family of prior distributions, and they recommend
choosing the highest "long" step. The functions described here help
automate the process of dividing the step lengths into "long" and
"short" classes.
agDimTwiceMean(stepLength) agDimKmeans(stepLength) agDimKmeans3(stepLength) agDimSpectral(stepLength) agDimTtest(stepLength, extra=0) agDimTtest2(stepLength) agDimCPT(stepLength) makeAgCpmFun(method)
agDimTwiceMean(stepLength) agDimKmeans(stepLength) agDimKmeans3(stepLength) agDimSpectral(stepLength) agDimTtest(stepLength, extra=0) agDimTtest2(stepLength) agDimCPT(stepLength) makeAgCpmFun(method)
stepLength |
A numeric vector |
method |
A character string describing a method supported by the
|
extra |
Just ignore this. Don't use it. It's a hack to avoid having to maintain two different versions of the same code. |
The agDimTwiceMean
function implements a simple and naive rule:
a step is considered long if it as least twice the mean length.
The agDimKmeans
uses the kmeans
algorithm with
to divide the step lengths into two classes. Starting
centers for the groups are taken to be the minimum and maximum
values.
The agDimKmeans3
function uses kmeans
with ,
using the median as the third center. Only one of the three groups is
considered "short".
The agDimSpectral
applies spectral clustering (as implemented
by the specc
function from the kernlab
package)
to divide the steps lengths into two groups.
The agDimTtest
and agDimTtest2
functions implement two
variants of a novel algorithm specialized for this particular task.
The idea is to start by sorting the step lengths so that
Then, for each , we
compute the mean and standard deviation of the first
step
lengths. Finally, one computes the likelhood that
comes
from the normal distribution defined by the first
lengths. If
the probability that
is larger is less than
,
then it is chosen as the "smallest long step".
The novel method just described can also be viewed as a way to detect
a particular kind of change point. So, we also provide the
agDimCPT
function that uses the changepoint detection
algorithm implement by the cpt.mean
function in the
changepoint
package. More generally, the makeAgCpmFun
allows you to use any of the changepoint models implemented as part
of the detectChangePointBatch
function in the cpm
package.
Each of the functions agDimTwiceMean
, agDimKmeans
,
agDimKmeans3
, agDimSpectral
, agDimTtest
,
agDimTtest2
, and agDimCPT
returns a logical vector whose
length is equal to the input stepLength
. TRUE
values
identify "long" steps and FALSE
values identify "short"
steps.
The makeAgCpmFun
returns a function that takes one argument (a
numeric stepLength
vector) and returns a logical vector of the
same length.
Note: Our simulations suggest that "TwiceMean" and "CPM" give the best results.
Kevin R. Coombes <[email protected]>, Min Wang <[email protected]>.
P Auer, D Gervini. Choosing principal components: a new graphical method based on Bayesian model selection. Communications in Statistics-Simulation and Computation 37 (5), 962-977
The functions described here implerment different algorithms that can
be used by the agDimension
function to automatically
compute the number of significant principal components based on the
AuerGervini
approach. Several of these functions are
wrappers around functions defined in other packages, including
specc
in the kernlab
package,
cpt.mean
in the changepoint
package, and
detectChangePointBatch
in the cpm
package.
# simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # Review the results summary(ag) agDimension(ag) agDimension(ag, agDimKmeans) agDimension(ag, agDimSpectral) f <- makeAgCpmFun("Exponential") agDimension(ag, f)
# simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # Review the results summary(ag) agDimension(ag) agDimension(ag, agDimKmeans) agDimension(ag, agDimSpectral) f <- makeAgCpmFun("Exponential") agDimension(ag, f)
Auer and Gervini [1] described a graphical Bayesian method for
estimating the number of statistically significant principal
components. We have implemented their method in the AuerGervini
class, and enhanced it by automating the final selection.
AuerGervini(Lambda, dd=NULL, epsilon = 2e-16) agDimension(object, agfun=agDimTwiceMean)
AuerGervini(Lambda, dd=NULL, epsilon = 2e-16) agDimension(object, agfun=agDimTwiceMean)
Lambda |
Either a |
dd |
A vector of length 2 containing the dimensions of the data
used to created the Auer-Gervini object. If |
epsilon |
A numeric value. Used to remove any variances that are
less than |
object |
An object of the |
agfun |
A function that takes one argument (a vector of step lengths) and returns a logical vector of the same length (where true indicates "long" as opposed to "short" steps). |
The Auer-Gervini method for determining the number of principal components is based on a Bayesian model that assaerts that the vector of explained variances (eigenvalues) should have the form
with the goal being to find the true dimension . They consider
a set of prior distributions on
that decay
exponentially, with the rate of decay controlled by a parameter
. For each value of
, one selects the value
of
that has the maximum a posteriori (MAP) probability. Auer
and Gervini show that the dimensions selected by this procedure write
as a non-increasing step function of
. The values
of
where the steps change are stored in the
changePoints
slot, and the corresponding -values are
stored in the
dLevels
slot.
Auer and Gervini go on to advise using their method as a graphical
approach, manually (or visually?) selecting the highest step that is
"long". Our implementation provides several different algorithms for
automatically deciding what is "long" enough. The simplest (but
fairly naive) approach is to take anything that is longer than twice
the mean; other algorithms are described in
agDimFunction
.
The AuerGervini
function constructs and returns an object of
the AuerGervini
class.
The agDimension
function computes the number of significant
principal components. The general idea is that one starts by
computing the length of each step in the Auer-Gerivni plot, and must
then separate these into "long" and "short" classes. We provide a
variety of different algorithms to carry out this process; the
default algorithm in the function agDimTwiceMean
defines
a step as "long" if it more than twice the mean step length.
Objects should be created using the AuerGervini
constructor.
Lambda
:A numeric
vector containing
the explained variances in decreasing order.
dimensions
Numeric vector of length 2 containing the dimnesions of the underlying data matrix.
dLevels
:Object of class numeric
; see details
changePoints
:Object of class numeric
; see details
signature(x = "AuerGervini", y = "missing")
: ...
signature(object = "AuerGervini")
: ...
Kevin R. Coombes <[email protected]>
[1] P Auer, D Gervini. Choosing principal components: a new graphical method based on Bayesian model selection. Communications in Statistics-Simulation and Computation 37 (5), 962-977.
[2] Wang M, Kornbla SM, Coombes KR. Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components. Preprint: bioRxiv, 2017. <doi://10.1101/237883>.
agDimFunction
to get a complete list of the functions
implementing different algorithms to separate the step lengths into
two classes.
showClass("AuerGervini") # simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # Review the results summary(ag) agDimension(ag) agDimension(ag, agDimKmeans) # Look at the results graphically plot(ag, agfun=list(agDimTwiceMean, agDimKmeans))
showClass("AuerGervini") # simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # Review the results summary(ag) agDimension(ag) agDimension(ag, agDimKmeans) # Look at the results graphically plot(ag, agfun=list(agDimTwiceMean, agDimKmeans))
The Broken Stick model is one proposed method for estimating the number of statistically significant principal components.
brokenStick(k, n) bsDimension(lambda, FUZZ = 0.005)
brokenStick(k, n) bsDimension(lambda, FUZZ = 0.005)
k |
An integer between 1 and |
n |
An integer; the total number of principal components. |
lambda |
The set of variances from each component from a principal
components analysis. These are assumed to be already sorted in
decreasing order. You can also supply a |
FUZZ |
A real number; anything smaller than |
The Broken Stick model is one proposed method for estimating the
number of statistically significant principal components. The idea is
to model variances by taking a stick of unit length and breaking it
into
pieces by randomly (and simultaneously) selecting break
points from a uniform distribution.
The brokenStick
function returns, as a real number, the
expected value of the k
-th longest piece when breaking a
stick of length one into n
total pieces. Most commonly used
via the idiom brokenStick(1:N, N)
to get the entire vector of
lengths at one time.
The bsDimension
function returns an integer, the number of
significant components under this model. This is computed by finding
the last point at which the observed variance is bugger than the
expected value under the broken stick model by at least FUZZ
.
Kevin R. Coombes <[email protected]>
Jackson, D. A. (1993). Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74, 2204–2214.
Legendre, P. and Legendre, L. (1998) Numerical Ecology. 2nd English ed. Elsevier.
Better methods to address this question are based on the Auer-Gervini
method; see AuerGervini
.
brokenStick(1:10, 10) sum( brokenStick(1:10, 10) ) fakeVar <- c(30, 20, 8, 4, 3, 2, 1) bsDimension(fakeVar)
brokenStick(1:10, 10) sum( brokenStick(1:10, 10) ) fakeVar <- c(30, 20, 8, 4, 3, 2, 1) bsDimension(fakeVar)
Auer and Gervini developed a Bayesian graphical method to determine
the number of significant principal components; a brief
overview is included in the help for the
AuerGervini
class. The output of their method is a step function that displays
the maximum a posteriori (MAP) choice of as a step function of
a one-parameter family of prior distributions, and they recommend
choosing the highest "long" step. The functions described here help
automate the process of dividing the step lengths into "long" and
"short" classes.
compareAgDimMethods(object, agfuns)
compareAgDimMethods(object, agfuns)
object |
An object of the |
agfuns |
A list of functions |
This method simply iterates over the list of functions that implement different algorithms/methods to determine the PC dimension.
Returns an integer vector of te same length as the list of
agfuns
, containing the number of significant principal
components computed by each method.
Kevin R. Coombes <[email protected]>, Min Wang <[email protected]>.
P Auer, D Gervini. Choosing principal components: a new graphical method based on Bayesian model selection. Communications in Statistics-Simulation and Computation 37 (5), 962-977
# simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # try different methods agfuns <- list(twice=agDimTwiceMean, km=agDimKmeans, cpt=agDimCPT) compareAgDimMethods(ag, agfuns)
# simulate variances lambda <- rev(sort(diff(sort(c(0, 1, runif(9)))))) # apply the Auer-Gervini method ag <- AuerGervini(lambda, dd=c(3,10)) # try different methods agfuns <- list(twice=agDimTwiceMean, km=agDimKmeans, cpt=agDimCPT) compareAgDimMethods(ag, agfuns)
Implements randomization-based procedures to estimate the number of principal components.
rndLambdaF(data, B = 1000, alpha = 0.05)
rndLambdaF(data, B = 1000, alpha = 0.05)
data |
A numeric data matrix. |
B |
An integer; the number of times to scramble the data columns. |
alpha |
A real number between 0 and 1; the significance level. |
The randomization procedures implemented here were first developed by ter Brack [1,2]. In a simulation study, Peres-Neto and colleagues concluded that these methods were among the best [3]. Our own simulations on larger data matrices find that rnd-Lambda performs well (comparably to Auer-Gervini, though slower), but that rnd-F works poorly.
The test procedure is: (1) randomize the values with all the attribute
columns of the data matrix; (2) perform PCA on the scrambled data
matrix; and (3) compute the test statistics. All three steps are
repeated a total of (B - 1) times, where B is large enough to
guarantee accuracy when estimating p-values; in practice, B is usually
set to 1000. In each randomization, two test statistics are computed:
(1) the eigenvalue for the k-th principal component; and
(2) a pseudo F-ratio computed as
.
Finally, the p-value for each k and each statistic of
interest is estimated to be the proportion of the test statistics in
all data sets that are greater than or equal to the one in the
observed data matrix.
A named vector of length two, containing the predicted number of principal components based on the rnd-Lambda and rnd-F statistics.
Kevin R. Coombes <[email protected]>, Min Wang <[email protected]>.
[1] ter Braak CFJ. CANOCO – a Fortran program for canonical community ordination by [partial] [detrended] [canonical] correspondence analysis, principal component analysis and redundancy analysis (version 2.1). Agricultural Mathematics Group, Report LWA-88- 02, Wageningen, 1988.
[2] ter Braak CFJ. Update notes: CANOCO (version 3.1). Agricultural Mathematics Group, Wageningen, 1990.
[3] Peres-Neto PR, Jackson DA and Somers KM. How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis 2005; 49: 974–997.
dataset <- matrix(rnorm(200*15, 6), ncol=15) rndLambdaF(dataset)
dataset <- matrix(rnorm(200*15, 6), ncol=15) rndLambdaF(dataset)
This data set contains an object of the class SamplePCA
.
This object results from performing a principal components analysis on
a simulated data set.
data(spca)
data(spca)
A SamplePCA
object based on a simulated data matrix with
204 rows and 14 columns, with true "principal component dimension"
equal to one. That is, there should be one significant principal
component.
Simulations are described in detail in the Thresher
package,
which depends on the PCDimension
package.
The ClassDiscovery package contains the SamplePCA
class and functions.