Title: | Robust PCA by Projection Pursuit |
---|---|
Description: | Provides functions for robust PCA by projection pursuit. The methods are described in Croux et al. (2006) <doi:10.2139/ssrn.968376>, Croux et al. (2013) <doi:10.1080/00401706.2012.727746>, Todorov and Filzmoser (2013) <doi:10.1007/978-3-642-33042-1_31>. |
Authors: | Peter Filzmoser [aut], Heinrich Fritz [aut], Klaudius Kalcher [aut], Valentin Todorov [cre] |
Maintainer: | Valentin Todorov <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0-5 |
Built: | 2024-10-31 21:20:55 UTC |
Source: | CRAN |
Calculates Kendall's tau rank correlation coefficient in O (n log (n)) rather than O (n^2) as in the current R implementation.
cor.fk (x, y = NULL)
cor.fk (x, y = NULL)
x |
A vector, a matrix or a data frame of data. |
y |
A vector of data. |
The code of this implementation of the fast Kendall's tau correlation algorithm has
originally been published by David Simcha.
Due to it's runtime (O(n log n)
) it's essentially faster than the
current R implementation (O (n\^2)
), especially for large numbers of
observations.
The algorithm goes back to Knight (1966) and has been described more detailed
by Abrevaya (1999) and Christensen (2005).
The estimated correlation coefficient.
David Simcha, Heinrich Fritz, Christophe Croux, Peter Filzmoser <[email protected]>
Knight, W. R. (1966). A Computer Method for Calculating Kendall's Tau with Ungrouped Data.
Journal of the American Statistical Association, 314(61) Part 1, 436-439.
Christensen D. (2005). Fast algorithms for the calculation of Kendall's Tau.
Journal of Computational Statistics 20, 51-62.
Abrevaya J. (1999). Computation of the Maximum Rank Correlation Estimator.
Economic Letters 62, 279-285.
set.seed (100) ## creating test data n <- 1000 x <- rnorm (n) y <- x+ rnorm (n) tim <- proc.time ()[1] ## applying cor.fk cor.fk (x, y) cat ("cor.fk runtime [s]:", proc.time ()[1] - tim, "(n =", length (x), ")\n") tim <- proc.time ()[1] ## applying cor (standard R implementation) cor (x, y, method = "kendall") cat ("cor runtime [s]:", proc.time ()[1] - tim, "(n =", length (x), ")\n") ## applying cor and cor.fk on data containing Xt <- cbind (c (x, as.integer (x)), c (y, as.integer (y))) tim <- proc.time ()[1] ## applying cor.fk cor.fk (Xt) cat ("cor.fk runtime [s]:", proc.time ()[1] - tim, "(n =", nrow (Xt), ")\n") tim <- proc.time ()[1] ## applying cor (standard R implementation) cor (Xt, method = "kendall") cat ("cor runtime [s]:", proc.time ()[1] - tim, "(n =", nrow (Xt), ")\n")
set.seed (100) ## creating test data n <- 1000 x <- rnorm (n) y <- x+ rnorm (n) tim <- proc.time ()[1] ## applying cor.fk cor.fk (x, y) cat ("cor.fk runtime [s]:", proc.time ()[1] - tim, "(n =", length (x), ")\n") tim <- proc.time ()[1] ## applying cor (standard R implementation) cor (x, y, method = "kendall") cat ("cor runtime [s]:", proc.time ()[1] - tim, "(n =", length (x), ")\n") ## applying cor and cor.fk on data containing Xt <- cbind (c (x, as.integer (x)), c (y, as.integer (y))) tim <- proc.time ()[1] ## applying cor.fk cor.fk (Xt) cat ("cor.fk runtime [s]:", proc.time ()[1] - tim, "(n =", nrow (Xt), ")\n") tim <- proc.time ()[1] ## applying cor (standard R implementation) cor (Xt, method = "kendall") cat ("cor runtime [s]:", proc.time ()[1] - tim, "(n =", nrow (Xt), ")\n")
computes the covariance matrix from a princomp object. The number of components k can be given as input.
covPC(x, k, method)
covPC(x, k, method)
x |
an object of class princomp. |
k |
number of PCs to use for covariance estimation (optional). |
method |
method how the PCs have been estimated (optional). |
There are several possibilities to estimate the principal components (PCs)
from an input data matrix, including the functions PCAproj
and
PCAgrid
. This function uses the estimated PCs to reconstruct
the covariance matrix. Not all PCs have to be used, the number k of
PCs (first k PCs) can be given as input to the function.
cov |
the estimated covariance matrix |
center |
the center of the data, as provided from the princomp object. |
method |
a string describing the method that was used to calculate the PCs. |
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) pc <- princomp(x) covPC(pc, k=2)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) pc <- princomp(x) covPC(pc, k=2)
computes the robust covariance matrix using the PCAgrid
and
PCAproj
functions.
covPCAproj(x, control) covPCAgrid(x, control)
covPCAproj(x, control) covPCAgrid(x, control)
x |
a numeric matrix or data frame which provides the data. |
control |
a list whose elements must be the same as (or a subset of)
the parameters of the appropriate PCA function ( |
The functions covPCAproj
and covPCAgrid
use the functions
PCAproj
and PCAgrid
respectively to estimate
the covariance matrix of the data matrix x
.
cov |
the actual covariance matrix estimated from |
center |
the center of the data |
method |
a string describing the method that was used to calculate the covariance matrix estimation |
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) covPCAproj(x) # compare with classical covariance matrix: cov(x)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) covPCAproj(x) # compare with classical covariance matrix: cov(x)
Draws a sample data set, as introduced by Zou et al. (2006).
data.Zou (n = 250, p = c(4, 4, 2), ...)
data.Zou (n = 250, p = c(4, 4, 2), ...)
n |
The required number of observations. |
p |
A vector of length 3, specifying how many variables shall be constructed using the three factors V1, V2 and V3. |
... |
Further arguments passed to or from other functions. |
This data set has been introduced by Zou et al. (2006), and then been referred to several times, e.g. by Farcomeni (2009), Guo et al. (2010) and Croux et al. (2011).
The data set contains two latent factors V1 ~ N(0, 290) and V2 ~ N(0, 300) and
a third mixed component V3 = -0.3 V1 + 0.925V2 + e; e ~ N(0, 1).
The ten variables Xi of the original data set are constructed the following
way:
Xi = V1 + ei; i = 1, 2, 3, 4
Xi = V2 + ei; i = 5, 6, 7, 8
Xi = V3 + ei; i = 9, 10
whereas ei ~ N(0, 1) is indepependent for i = 1 , ..., 10
A matrix of dimension n x sum (p)
containing the generated sample data
set.
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.
A. Farcomeni (2009). An exact approach to sparse principal component analysis, Computational Statistics, Vol. 24(4), pp. 583-604.
J. Guo, G. James, E. Levina, F. Michailidis, and J. Zhu (2010). Principal component analysis with sparse fused loadings, Journal of Computational and Graphical Statistics. To appear.
H. Zou, T. Hastie, R. Tibshirani (2006). Sparse principal component analysis, Journal of Computational and Graphical Statistics, Vol. 15(2), pp. 265-286.
## data generation set.seed (0) x <- data.Zou () ## applying PCA pc <- princomp (x) ## the corresponding non-sparse loadings unclass (pc$load[,1:3]) pc$sdev[1:3] ## lambda as calculated in the opt.TPO - example lambda <- c (0.23, 0.34, 0.005) ## applying sparse PCA spc <- sPCAgrid (x, k = 3, lambda = lambda, method = "sd") unclass (spc$load) spc$sdev[1:3] ## comparing the non-sparse and sparse biplot par (mfrow = 1:2) biplot (pc, main = "non-sparse PCs") biplot (spc, main = "sparse PCs")
## data generation set.seed (0) x <- data.Zou () ## applying PCA pc <- princomp (x) ## the corresponding non-sparse loadings unclass (pc$load[,1:3]) pc$sdev[1:3] ## lambda as calculated in the opt.TPO - example lambda <- c (0.23, 0.34, 0.005) ## applying sparse PCA spc <- sPCAgrid (x, k = 3, lambda = lambda, method = "sd") unclass (spc$load) spc$sdev[1:3] ## comparing the non-sparse and sparse biplot par (mfrow = 1:2) biplot (pc, main = "non-sparse PCs") biplot (spc, main = "sparse PCs")
Computes the multivariate L1 median (also called spatial median) of a data matrix.
l1median(X, MaxStep = 200, ItTol = 10^-8, trace = 0, m.init = .colMedians (X))
l1median(X, MaxStep = 200, ItTol = 10^-8, trace = 0, m.init = .colMedians (X))
X |
A matrix containing the values whose multivariate L1 median is to be computed. |
MaxStep |
The maximum number of iterations. |
ItTol |
Tolerance for convergence of the algorithm. |
trace |
The tracing level. |
m.init |
An initial estimate. |
returns the vector of the coordinates of the L1 median.
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
l1median(rnorm(100), trace = -1) # this returns the median of the sample # multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 4), diag(c(1, 1, 2, 2))), rmvnorm( 50, rep(3, 4), diag(rep(2, 4)))) l1median(x, trace = -1) # compare with coordinate-wise median: apply(x,2,median)
l1median(rnorm(100), trace = -1) # this returns the median of the sample # multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 4), diag(c(1, 1, 2, 2))), rmvnorm( 50, rep(3, 4), diag(rep(2, 4)))) l1median(x, trace = -1) # compare with coordinate-wise median: apply(x,2,median)
Computes the multivariate L1 median (also called spatial median) of a data matrix X
.
l1median_NM (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_CG (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_BFGS (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), REPORT = 10, ...) l1median_NLM (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_HoCr (X, maxit = 200, tol = 10^-8, zero.tol = 1e-15, trace = 0, m.init = .colMedians (X), ...) l1median_VaZh (X, maxit = 200, tol = 10^-8, zero.tol = 1e-15, trace = 0, m.init = .colMedians (X), ...)
l1median_NM (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_CG (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_BFGS (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), REPORT = 10, ...) l1median_NLM (X, maxit = 200, tol = 10^-8, trace = 0, m.init = .colMedians (X), ...) l1median_HoCr (X, maxit = 200, tol = 10^-8, zero.tol = 1e-15, trace = 0, m.init = .colMedians (X), ...) l1median_VaZh (X, maxit = 200, tol = 10^-8, zero.tol = 1e-15, trace = 0, m.init = .colMedians (X), ...)
X |
a matrix of dimension |
maxit |
The maximum number of iterations to be performed. |
tol |
The convergence tolerance. |
trace |
The tracing level. Set |
m.init |
A vector of length |
REPORT |
A parameter internally passed to |
zero.tol |
The zero-tolerance level used in |
... |
Further parameters passed from other functions. |
The L1-median is computed using the built-in functions
optim
and nlm
.
These functions are a transcript of the L1median
method of package robustX
, porting as much code as possible into C++.
par |
A vector of length |
value |
The value of the objective function |
code |
The return code of the optimization algorithm. See |
iterations |
The number of iterations performed. |
iterations_gr |
When using a gradient function this value holds the number of times the gradient had to be computed. |
time |
The algorithms runtime in milliseconds. |
See the vignette "Compiling pcaPP for Matlab" which comes with this package to compile and use some of these functions in Matlab.
Heinrich Fritz, Peter Filzmoser <[email protected]>
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 4), diag(c(1, 1, 2, 2))), rmvnorm( 50, rep(3, 4), diag(rep(2, 4)))) l1median_NM (x)$par l1median_CG (x)$par l1median_BFGS (x)$par l1median_NLM (x)$par l1median_HoCr (x)$par l1median_VaZh (x)$par # compare with coordinate-wise median: apply(x,2,median)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 4), diag(c(1, 1, 2, 2))), rmvnorm( 50, rep(3, 4), diag(rep(2, 4)))) l1median_NM (x)$par l1median_CG (x)$par l1median_BFGS (x)$par l1median_NLM (x)$par l1median_HoCr (x)$par l1median_VaZh (x)$par # compare with coordinate-wise median: apply(x,2,median)
Plots an objective function (TPO or BIC) of one or more sparse PCs for a series of lambdas.
objplot (x, k, ...)
objplot (x, k, ...)
x |
|
k |
This function displays the objective function's values of the
|
... |
Further arguments passed to or from other functions. |
This function operates on the return object of function
opt.TPO
or opt.BIC
.
The model (lambda
) selected by the minimization of the corresponding
criterion is highlighted by a dashed vertical line.
The component the argument k
refers to, corresponds to the
$pc.noord
item of argument x
.
For more info on the order of sparse PCs see the details section of
opt.TPO
.
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) ## Objective function vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) objplot (oTPO, k = i) for (i in 1:k.max) objplot (oBIC, k = i)
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) ## Objective function vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) objplot (oTPO, k = i) for (i in 1:k.max) objplot (oBIC, k = i)
These functions compute a suggestion for the sparseness parameter
lambda
which is required by function sPCAgrid
.
A range of different values for lambda is tested and
according to an objective function, the best solution is selected.
Two different approaches (TPO and BIC) are available, which is further
discussed in the details section.
A graphical summary of the optimization can be obtained by plotting
the function's return value (plot.opt.TPO
,
plot.opt.BIC
for tradeoff curves or objplot
for an objective function plot).
opt.TPO (x, k.max = ncol (x), n.lambda = 30, lambda.max, ...) opt.BIC (x, k.max = ncol (x), n.lambda = 30, lambda.max, ...)
opt.TPO (x, k.max = ncol (x), n.lambda = 30, lambda.max, ...) opt.BIC (x, k.max = ncol (x), n.lambda = 30, lambda.max, ...)
x |
a numerical matrix or data frame of dimension ( |
k.max |
the maximum number of components which shall be considered for optimizing an objective function (optional). |
n.lambda |
the number of lambdas to be checked for each component (optional). |
lambda.max |
the maximum value of lambda to be checked (optional). If omitted, the lambda which yields "full sparseness" (i.e. loadings of only zeros and ones) is computed and used as default value. |
... |
further arguments passed to |
The choice for a particular lambda is done by optimizing an objective function,
which is calculated for a set of n.lambda
models with different
lambdas, ranging from 0 to lambda.max
. If lambda.max
is not
specified, the minimum lambda yielding "full sparseness" is used.
"Full sparseness" refers to a model with minimum possible absolute sum of
loadings, which in general implies only zeros and ones in the loadings matrix.
The user can choose between two optimization methods:
TPO (Tradeoff Product Optimization; see below), or the
BIC (see Guo et al., 2011; Croux et al., 2011).
The main difference is, that optimization based on the BIC always chooses the
same lambda for all PCs, and refers to a particular choice of k
,
the number of considered components.
TPO however is optimized separately for each component, and so delivers
different lambdas within a model and does not depend on a decision on k
.
This characteristic can be noticed in the return value of the function:
opt.TPO
returns a single model with k.max
PCs and
different values of lambda
for each PC.
On the contrary opt.BIC
returns k.max
distinct models
with k.max
different lambdas, whereas for each model a different
number of components k
has been considered for the optimization.
Applying the latter method, the user finally has to select one of these
k.max
models manually,
e.g. by considering the cumulated explained variance,
whereas the TPO method does not require any further decisions.
The tradeoff made in the context of sparse PCA refers to the loss of explained
variance vs. the gain of sparseness. TPO (Tradeoff Product Optimization)
maximizes the area under the tradeoff curve (see plot.opt.TPO
),
in particular it maximizes the explained variance multiplied by the number of
zero loadings of a particular component. As in this context the according
criterion is minimized, the negative product is considered.
Note that in the context of sparse PCA, there are two sorting orders of PCs,
which must be considered: Either according to the objective function's value,
(item $pc.noord
)or the variance of each PC(item $pc
).
As in none-sparse PCA the objective function is identical to the PCs'
variance, this is not an issue there.
The sPCAgrid algorithm delivers the components in decreasing order, according
to the objective function (which apart from the variance also includes sparseness
terms), whereas the method sPCAgrid
subsequently re-orders the
components according to their explained variance.
The functions return an S3 object of type "opt.TPO" or "opt.BIC" respectively, containing the following items:
pc |
An S3 object of type |
pc.noord |
An S3 object of type |
x |
The input data matrix as provided by the user. |
k.ini , opt
|
These items contain optimization information, as used in
functions |
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) oTPO$pc ## the model selected by opt.TPO oTPO$pc$load ## and the according sparse loadings. ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) oBIC$pc[[1]] ## the first model selected by opt.BIC (k = 1) ## Tradeoff Curves: Explained Variance vs. sparseness par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i) for (i in 1:k.max) plot (oBIC, k = i) ## Tradeoff Curves: Explained Variance vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i, f.x = "lambda") for (i in 1:k.max) plot (oBIC, k = i, f.x = "lambda") ## Objective function vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) objplot (oTPO, k = i) for (i in 1:k.max) objplot (oBIC, k = i)
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) oTPO$pc ## the model selected by opt.TPO oTPO$pc$load ## and the according sparse loadings. ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) oBIC$pc[[1]] ## the first model selected by opt.BIC (k = 1) ## Tradeoff Curves: Explained Variance vs. sparseness par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i) for (i in 1:k.max) plot (oBIC, k = i) ## Tradeoff Curves: Explained Variance vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i, f.x = "lambda") for (i in 1:k.max) plot (oBIC, k = i, f.x = "lambda") ## Objective function vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) objplot (oTPO, k = i) for (i in 1:k.max) objplot (oBIC, k = i)
Computes a desired number of (sparse) (robust) principal components using the grid search algorithm in the plane. The global optimum of the objective function is searched in planes, not in the p-dimensional space, using regular grids in these planes.
PCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, center = l1median, scale, trace = 0, store.call = TRUE, control, ...) sPCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), lambda = 1, maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, center = l1median, scale, trace = 0, store.call = TRUE, control, ...)
PCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, center = l1median, scale, trace = 0, store.call = TRUE, control, ...) sPCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), lambda = 1, maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, center = l1median, scale, trace = 0, store.call = TRUE, control, ...)
x |
a numerical matrix or data frame of dimension ( |
k |
the desired number of components to compute |
method |
the scale estimator used to detect the direction with the
largest variance. Possible values are |
lambda |
the sparseness constraint's strength( |
maxiter |
the maximum number of iterations. |
splitcircle |
the number of directions in which the algorithm should search for the largest variance. The direction with the largest variance is searched for in the directions defined by a number of equally spaced points on the unit circle. This argument determines, how many such points are used to split the unit circle. |
scores |
A logical value indicating whether the scores of the principal component should be calculated. |
zero.tol |
the zero tolerance used internally for checking convergence, etc. |
center |
this argument indicates how the data is to be centered. It
can be a function like |
scale |
this argument indicates how the data is to be rescaled. It
can be a function like |
trace |
an integer value >= 0, specifying the tracing level. |
store.call |
a logical variable, specifying whether the function call shall be stored in the result structure. |
control |
a list which elements must be the same as (or a subset of) the parameters above. If the control object is supplied, the parameters from it will be used and any other given parameters are overridden. |
... |
further arguments passed to or from other functions. |
In contrast to PCAgrid
, the function sPCAgrid
computes sparse
principal components. The strength of the applied sparseness constraint is
specified by argument lambda
.
Similar to the function princomp
, there is a print
method
for the these objects that prints the results in a nice format and the
plot
method produces a scree plot (screeplot
). There is
also a biplot
method.
Angle halving is an extension of the original algorithm. In the original
algorithm, the search directions are determined by a number of points on the
unit circle in the interval [-pi/2 ; pi/2). Angle halving means this angle is
halved in each iteration, eg. for the first approximation, the above mentioned
angle is used, for the second approximation, the angle is halved to
[-pi/4 ; pi/4) and so on. This usually gives better results with less
iterations needed.
NOTE: in previous implementations angle halving could be suppressed by the
former argument "anglehalving
". This still can be done by setting
argument maxiter = 0
.
The function returns an object of class "princomp"
, i.e. a list
similar to the output of the function princomp
.
sdev |
the (robust) standard deviations of the principal components. |
loadings |
the matrix of variable loadings (i.e., a matrix whose columns
contain the eigenvectors). This is of class |
center |
the means that were subtracted. |
scale |
the scalings applied to each variable. |
n.obs |
the number of observations. |
scores |
if |
call |
the matched call. |
obj |
A vector containing the objective functions values. For function
|
lambda |
The lambda each component has been calculated with
( |
See the vignette "Compiling pcaPP for Matlab" which comes with this package to compile and use these functions in Matlab.
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PCAgrid(x) # we could draw a biplot too: biplot(pc) # now we want to compare the results with the non-robust principal components pc <- princomp(x) # again, a biplot for comparison: biplot(pc) ## Sparse loadings set.seed (0) x <- data.Zou () ## applying PCA pc <- princomp (x) ## the corresponding non-sparse loadings unclass (pc$load[,1:3]) pc$sdev[1:3] ## lambda as calculated in the opt.TPO - example lambda <- c (0.23, 0.34, 0.005) ## applying sparse PCA spc <- sPCAgrid (x, k = 3, lambda = lambda, method = "sd") unclass (spc$load) spc$sdev[1:3] ## comparing the non-sparse and sparse biplot par (mfrow = 1:2) biplot (pc, main = "non-sparse PCs") biplot (spc, main = "sparse PCs")
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PCAgrid(x) # we could draw a biplot too: biplot(pc) # now we want to compare the results with the non-robust principal components pc <- princomp(x) # again, a biplot for comparison: biplot(pc) ## Sparse loadings set.seed (0) x <- data.Zou () ## applying PCA pc <- princomp (x) ## the corresponding non-sparse loadings unclass (pc$load[,1:3]) pc$sdev[1:3] ## lambda as calculated in the opt.TPO - example lambda <- c (0.23, 0.34, 0.005) ## applying sparse PCA spc <- sPCAgrid (x, k = 3, lambda = lambda, method = "sd") unclass (spc$load) spc$sdev[1:3] ## comparing the non-sparse and sparse biplot par (mfrow = 1:2) biplot (pc, main = "non-sparse PCs") biplot (spc, main = "sparse PCs")
Computes a desired number of (robust) principal components using the algorithm of Croux and Ruiz-Gazen (JMVA, 2005).
PCAproj(x, k = 2, method = c("mad", "sd", "qn"), CalcMethod = c("eachobs", "lincomb", "sphere"), nmax = 1000, update = TRUE, scores = TRUE, maxit = 5, maxhalf = 5, scale = NULL, center = l1median_NLM, zero.tol = 1e-16, control)
PCAproj(x, k = 2, method = c("mad", "sd", "qn"), CalcMethod = c("eachobs", "lincomb", "sphere"), nmax = 1000, update = TRUE, scores = TRUE, maxit = 5, maxhalf = 5, scale = NULL, center = l1median_NLM, zero.tol = 1e-16, control)
x |
a numeric matrix or data frame which provides the data for the principal components analysis. |
k |
desired number of components to compute |
method |
scale estimator used to detect the direction with the largest
variance. Possible values are |
CalcMethod |
the variant of the algorithm to be used. Possible values are
|
nmax |
maximum number of directions to search in each step (only when
using |
update |
a logical value indicating whether an update algorithm should be used. |
scores |
a logical value indicating whether the scores of the principal component should be calculated. |
maxit |
maximim number of iterations. |
maxhalf |
maximum number of steps for angle halving. |
scale |
this argument indicates how the data is to be rescaled. It
can be a function like |
center |
this argument indicates how the data is to be centered. It
can be a function like |
zero.tol |
the zero tolerance used internally for checking convergence, etc. |
control |
a list which elements must be the same as (or a subset of) the parameters above. If the control object is supplied, the parameters from it will be used and any other given parameters are overridden. |
Basically, this algrithm considers the directions of each observation
through the origin of the centered data as possible projection directions.
As this algorithm has some drawbacks, especially if ncol(x) > nrow(x)
in the data matrix, there are several improvements that can be used with this
algorithm.
update - An updating step basing on the algorithm for finding the
eigenvectors is added to the algorithm. This can be used with any
CalcMethod
sphere - Additional search directions are added using random directions. The random directions are determined using random data points generated from a p-dimensional multivariate standard normal distribution. These new data points are projected to the unit sphere, giving the new search directions.
lincomb - Additional search directions are added using linear
combinations of the observations. It is similar to the
"sphere"
- algorithm, but the new data points are generated using linear
combinations of the original data b_1*x_1 + ... + b_n*x_n
where the
coefficients b_i
come from a uniform distribution in the interval
[0, 1]
.
Similar to the function princomp
, there is a print
method
for the these objects that prints the results in a nice format and the plot
method produces a scree plot (screeplot
). There is also a
biplot
method.
The function returns a list of class "princomp"
, i.e. a list similar to the
output of the function princomp
.
sdev |
the (robust) standard deviations of the principal components. |
loadings |
the matrix of variable loadings (i.e., a matrix whose columns
contain the eigenvectors). This is of class |
center |
the means that were subtracted. |
scale |
the scalings applied to each variable. |
n.obs |
the number of observations. |
scores |
if |
call |
the matched call. |
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PCAproj(x, 6) # we could draw a biplot too: biplot(pc) # we could use another calculation method and another objective function, and # maybe only calculate the first three principal components: pc <- PCAproj(x, 3, "qn", "sphere") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- princomp(x) # again, a biplot for comparision: biplot(pc)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PCAproj(x, 6) # we could draw a biplot too: biplot(pc) # we could use another calculation method and another objective function, and # maybe only calculate the first three principal components: pc <- PCAproj(x, 3, "qn", "sphere") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- princomp(x) # again, a biplot for comparision: biplot(pc)
Computes Orthogonal Distances (OD) and Score Distances (SD) for already computed principal components using the projection pursuit technique.
PCdiagplot(x, PCobj, crit = c(0.975, 0.99, 0.999), ksel = NULL, plot = TRUE, plotbw = TRUE, raw = FALSE, colgrid = "black", ...)
PCdiagplot(x, PCobj, crit = c(0.975, 0.99, 0.999), ksel = NULL, plot = TRUE, plotbw = TRUE, raw = FALSE, colgrid = "black", ...)
x |
a numeric matrix or data frame which provides the data for the principal components analysis. |
PCobj |
|
crit |
quantile(s) used for the critical value(s) for OD and SD |
ksel |
range for the number of PCs to be used in the plot; if NULL all PCs provided are used |
plot |
if TRUE a plot is generated, otherwise only the values are returned |
plotbw |
if TRUE the plot uses gray, otherwise color representation |
raw |
if FALSE, the distribution of the SD will be transformed to approach chisquare distribution, otherwise the raw values are reported and used for plotting |
colgrid |
the color used for the grid lines in the plot |
... |
additional graphics parameters as used in |
Based on (robust) principal components, a diagnostics plot is made using Orthogonal Distance (OD) and Score Distance (SD). This plot can provide important information about the multivariate data structure.
ODist |
matrix with OD for each observation (rows) and each selected PC (cols) |
SDist |
matrix with SD for each observation (rows) and each selected PC (cols) |
critOD |
matrix with critical values for OD for each selected PC (rows) and each critical value (cols) |
critSD |
matrix with critical values for SD for each selected PC (rows) and each critical value (cols) |
Peter Filzmoser <[email protected]>
P. Filzmoser and H. Fritz (2007). Exploring high-dimensional data with robust principal components. In S. Aivazian, P. Filzmoser, and Yu. Kharin, editors, Proceedings of the Eighth International Conference on Computer Data Analysis and Modeling, volume 1, pp. 43-50, Belarusian State University, Minsk.
M. Hubert, P.J. Rousseeuwm, K. Vanden Branden (2005). ROBCA: a new approach to robust principal component analysis Technometrics 47, pp. 64-79.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(85, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pcrob <- PCAgrid(x, k=6) resrob <- PCdiagplot(x,pcrob,plotbw=FALSE) # compare with classical method: pcclass <- PCAgrid(x, k=6, method="sd") resclass <- PCdiagplot(x,pcclass,plotbw=FALSE)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(85, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pcrob <- PCAgrid(x, k=6) resrob <- PCdiagplot(x,pcrob,plotbw=FALSE) # compare with classical method: pcclass <- PCAgrid(x, k=6, method="sd") resclass <- PCdiagplot(x,pcclass,plotbw=FALSE)
Tradeoff curves of one or more sparse PCs for a series of lambdas, which contrast the loss of explained variance and the gain of sparseness.
## S3 method for class 'opt.TPO' plot(x, k, f.x = c ("l0", "pl0", "l1", "pl1", "lambda"), f.y = c ("var", "pvar"), ...) ## S3 method for class 'opt.BIC' plot(x, k, f.x = c ("l0", "pl0", "l1", "pl1", "lambda"), f.y = c ("var", "pvar"), ...)
## S3 method for class 'opt.TPO' plot(x, k, f.x = c ("l0", "pl0", "l1", "pl1", "lambda"), f.y = c ("var", "pvar"), ...) ## S3 method for class 'opt.BIC' plot(x, k, f.x = c ("l0", "pl0", "l1", "pl1", "lambda"), f.y = c ("var", "pvar"), ...)
x |
|
k |
This function plots the tradeoff curve of the
|
f.x , f.y
|
A string, specifying which information shall be plotted on the x and y - axis. See the details section for more information. |
... |
Further arguments passed to or from other functions. |
The argument f.x
can obtain the following values:
"l0"
: l0 - sparseness, which corresponds to the number of
zero loadings of the considered component(s).
"pl0"
: l0 - sparseness in percent (l0 - sparseness
ranges from 0
to p-1
for each component).
"l1"
: l1 - sparseness, which corresponds to
the negative sum of absolute
loadings of the considered component(s).
(The exact value displayed for a single component is
sqrt (p) - S
, with S
as the the absolute sum of loadings.)
As this value is a part of the objective function which selects
the candidate directions within the sPCAgrid
function,
this option is provided here.
"pl1"
The "l1 - sparseness" in percent (l1 - sparseness
ranges from 0
to sqrt (p-1)
for each component).
"lambda"
: The lambda used for computing a particular model.
The argument f.y
can obtain the following values:
"var"
: The (cumulated) explained variance of the considered
component(s). The value shown here is calculated using the variance
estimator specified via the method
argument of function
sPCAgrid
.
"pvar"
: The (cumulated) explained variance of the considered
component(s) in percent. The 100%-level is assumed as the sum of variances
of all columns of argument x
.
Again the same variance estimator is
used as specified via the method
argument of function
sPCAgrid
.
The subtitle summarizes the result of the applied criterion for selecting a value of lambda:
The name of the applied method (TPO/BIC).
The selected value of lambda
for the k
-th component
(opt.TPO
) or all computed components (opt.BIC
).
The empirical cumulated variance (ECV) of the first k
components
in percent.
The obtained l0-sparseness of the first k
components.
This function operates on the return object of function
opt.TPO
or opt.BIC
.
The model (lambda
) selected by the minimization of the corresponding
criterion is highlighted by a dashed vertical line.
The component the argument k
refers to, corresponds to the
$pc.noord
item of argument x
.
For more info on the order of sparse PCs see the details section of
opt.TPO
.
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) ## Tradeoff Curves: Explained Variance vs. sparseness par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i) for (i in 1:k.max) plot (oBIC, k = i) ## Explained Variance vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i, f.x = "lambda") for (i in 1:k.max) plot (oBIC, k = i, f.x = "lambda")
set.seed (0) ## generate test data x <- data.Zou (n = 250) k.max <- 3 ## max number of considered sparse PCs ## arguments for the sPCAgrid algorithm maxiter <- 25 ## the maximum number of iterations method <- "sd" ## using classical estimations ## Optimizing the TPO criterion oTPO <- opt.TPO (x, k.max = k.max, method = method, maxiter = maxiter) ## Optimizing the BIC criterion oBIC <- opt.BIC (x, k.max = k.max, method = method, maxiter = maxiter) ## Tradeoff Curves: Explained Variance vs. sparseness par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i) for (i in 1:k.max) plot (oBIC, k = i) ## Explained Variance vs. lambda par (mfrow = c (2, k.max)) for (i in 1:k.max) plot (oTPO, k = i, f.x = "lambda") for (i in 1:k.max) plot (oBIC, k = i, f.x = "lambda")
allows a direct comparison of two estimations of the covariance matrix (e.g. resulting from covPC) in a plot.
plotcov(cov1, cov2, method1, labels1, method2, labels2, ndigits, ...)
plotcov(cov1, cov2, method1, labels1, method2, labels2, ndigits, ...)
cov1 |
a covariance matrix (from cov, covMcd, covPC, covPCAgrid, covPCAproj, etc. |
cov2 |
a covariance matrix (from cov, covMcd, covPC, covPCAgrid, covPCAproj, etc. |
method1 |
legend for ellipses of estimation method1 |
method2 |
legend for ellipses of estimation method2 |
labels1 |
legend for numbers of estimation method1 |
labels2 |
legend for numbers of estimation method2 |
ndigits |
number of digits to use for printing covariances, by default ndigits=4 |
... |
additional arguments for text or plot |
Since (robust) PCA can be used to re-compute the (robust) covariance matrix,
one might be interested to compare two different methods of covariance
estimation visually. This routine takes as input objects for the covariances
to compare the output of cov
, but also the return objects
from covPCAgrid
, covPCAproj
, covPC
,
and covMcd
. The comparison of the two covariance matrices
is done by numbers (the covariances) and by ellipses.
only the plot is generated
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) plotcov(covPCAproj(x),covPCAgrid(x))
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) plotcov(covPCAproj(x),covPCAgrid(x))
Returns a scale estimation as calculated by the (robust) Qn estimator.
qn(x, corrFact)
qn(x, corrFact)
x |
a vector of data |
corrFact |
the finite sample bias correction factor. By default a value of ~ 2.219144 is used (assuming normality). |
The Qn estimator computes the first quartile of the pairwise absolute differences of all data values.
The estimated scale of the data.
Earlier implementations used a wrong correction factor for the final result. Thus qn estimations computed with package pcaPP version > 1.8-1 differ about 0.12% from earlier estimations (version <= 1.8-1).
See the vignette "Compiling pcaPP for Matlab" which comes with this package to compile and use this function in Matlab.
Heinrich Fritz, Peter Filzmoser <[email protected]>
P.J. Rousseeuw, C. Croux (1993) Alternatives to the Median Absolute Deviation, JASA, 88, 1273-1283.
# data with outliers x <- c(rnorm(100), rnorm(10, 10)) qn(x)
# data with outliers x <- c(rnorm(100), rnorm(10, 10)) qn(x)
Data is centered and rescaled (to have mean 0 and a standard deviation of 1).
ScaleAdv(x, center = mean, scale = sd)
ScaleAdv(x, center = mean, scale = sd)
x |
matrix containing the observations. If this is not a matrix, but
a data frame, it is automatically converted into a matrix using the function
|
center |
this argument indicates how the data is to be centered. It
can be a function like |
scale |
this argument indicates how the data is to be rescaled. It
can be a function like |
The default scale
being NULL
means that no rescaling is done.
The function returns a list containing
x |
centered and rescaled data matrix. |
center |
a vector of the centers of each column x. If you add to
each column of |
scale |
a vector of the scale factors of each column x. If you multiply
each column of |
Heinrich Fritz, Peter Filzmoser <[email protected]>
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.
x <- rnorm(100, 10, 5) x <- ScaleAdv(x)$x # can be used with multivariate data too library(mvtnorm) x <- rmvnorm(100, 3:7, diag((7:3)^2)) res <- ScaleAdv(x, center = l1median, scale = mad) res # instead of using an estimator, you could specify the center and scale yourself too x <- rmvnorm(100, 3:7, diag((7:3)^2)) res <- ScaleAdv(x, 3:7, 7:3) res
x <- rnorm(100, 10, 5) x <- ScaleAdv(x)$x # can be used with multivariate data too library(mvtnorm) x <- rmvnorm(100, 3:7, diag((7:3)^2)) res <- ScaleAdv(x, center = l1median, scale = mad) res # instead of using an estimator, you could specify the center and scale yourself too x <- rmvnorm(100, 3:7, diag((7:3)^2)) res <- ScaleAdv(x, 3:7, 7:3) res