Title: | Whitening and High-Dimensional Canonical Correlation Analysis |
---|---|
Description: | Implements the whitening methods (ZCA, PCA, Cholesky, ZCA-cor, and PCA-cor) discussed in Kessy, Lewin, and Strimmer (2018) "Optimal whitening and decorrelation", <doi:10.1080/00031305.2016.1277159>, as well as the whitening approach to canonical correlation analysis allowing negative canonical correlations described in Jendoubi and Strimmer (2019) "A whitening approach to probabilistic canonical correlation analysis for omics data integration", <doi:10.1186/s12859-018-2572-9>. The package also offers functions to simulate random orthogonal matrices, compute (correlation) loadings and explained variation. It also contains four example data sets (extended UCI wine data, TCGA LUSC data, nutrimouse data, extended pitprops data). |
Authors: | Korbinian Strimmer, Takoua Jendoubi, Agnan Kessy, Alex Lewin |
Maintainer: | Korbinian Strimmer <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.0 |
Built: | 2024-10-29 06:25:43 UTC |
Source: | CRAN |
The "whitening" package implements the whitening methods (ZCA, PCA, Cholesky, ZCA-cor, and PCA-cor) discussed in Kessy, Lewin, and Strimmer (2018) as well as the whitening approach to canonical correlation analysis allowing negative canonical correlations described in Jendoubi and Strimmer (2019).
Korbinian Strimmer (https://strimmerlab.github.io/) with Takoua Jendoubi, Agnan Kessy, and Alex Lewin.
Kessy, A., A. Lewin, and K. Strimmer. 2018. Optimal whitening and decorrelation. The American Statistician. 72: 309-314. <DOI:10.1080/00031305.2016.1277159>
Jendoubi, T., and K. Strimmer 2019. A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics 20: 15. <DOI:10.1186/s12859-018-2572-9>
Website: https://strimmerlab.github.io/software/whitening/
whiteningMatrix
,
whiten
,
whiteningLoadings
,
explainedVariation
,
cca
, and
scca
.
corplot
computes the correlation within and between X
and Y
and displays the three corresponding matrices visusally.
loadplot
computes the squared loadings for X
and Y
and plots the
resulting matrices.
corplot(cca.out, X, Y) loadplot(cca.out, numScores)
corplot(cca.out, X, Y) loadplot(cca.out, numScores)
cca.out |
|
X , Y
|
input data matrices. |
numScores |
number of CCA scores shown in plot. |
A plot.
Korbinian Strimmer (https://strimmerlab.github.io).
Part of the plot code was adapted from the img.matcor
function
in the CCA
package and from the image.plot
function in the fields
package.
scca
.
explainedVariation
computes the explained variation for each whitened variables from the loadings (both covariance loadings and correlation loadings).
explainedVariation(Phi)
explainedVariation(Phi)
Phi |
Loading matrix (with columns referring to whitened variables). |
explainedVariation
computes for each column of
the loading matrix the sum of squares of the elements in that column.
explainedVariation
returns a vector with the explained variation contributed by each whitened variable.
Korbinian Strimmer (https://strimmerlab.github.io).
Kessy, A., A. Lewin, and K. Strimmer. 2018. Optimal whitening and decorrelation. The American Statistician. 72: 309-314. <DOI:10.1080/00031305.2016.1277159>
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # PCA-cor loadings ldgs = whiteningLoadings(S, method="PCA-cor") # Explained variation from correlation loadings explainedVariation( ldgs$Psi )
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # PCA-cor loadings ldgs = whiteningLoadings(S, method="PCA-cor") # Explained variation from correlation loadings explainedVariation( ldgs$Psi )
The forina1986
dataset describes 27 properties of 178 samples of wine from three
grape varieties (59 Barolo, 71 Grignolino, 48 Barbera) as reported in Forina et al. (1986).
data(forina1986)
data(forina1986)
A list containing the following components:
attribs
collects measurements for 27 attributes of 178 wine samples.
type
describes the variety ("Barolo", "Grignolino", or "Barbera").
This data set contains the full set of covariates described in Forina et al. (1986) except for Sulphate (variable 15 in Forina et al. 1986). These are: 1) Alcohol, 2) Sugar-free extract, 3) Fixed acidity, 4) Tartaric acid, 5) Malic acid, 6) Uronic acids, 7) pH, 8) Ash, 9) Alcalinity of ash, 10) Potassium, 11) Calcium, 12) Magnesium, 13) Phosphate, 14) Chloride, 15) Total phenols, 16) Flavanoids, 17) Nonflavanoid phenols, 18) Proanthocyanins, 19) Color intensity, 20) Hue, 21) OD280/OD315 of diluted wines, 22) OD280/OD315 of flavonoids, 23) Glycerol, 24) 2-3-butanediol, 25) Total nitrogen, 26) Proline, and 27) Methanol.
The UCI wine data set (https://archive.ics.uci.edu/ml/datasets/wine) is a subset of the Forina et al. (1986) data set comprising only 13 variables.
The original data matrix is available from https://www.researchgate.net/publication/271908647_Wines_MForina_CArmanino_MCastino_MUbigli_Multivariate_data_analysis_as_discriminating_method_of_the_origin_of_wines_Vitis_25_189-201_1986.
Forina, M., Armanino, C., Castino, M., and Ubigli, M. Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189-201 (1986). https://ojs.openagrar.de/index.php/VITIS/article/view/5950.
# load whitening library library("whitening") # load Forina 1986 wine data set data(forina1986) table(forina1986$type) # Barolo Grignolino Barbera # 59 71 48 dim(forina1986$attrib) # 178 27 colnames(forina1986$attrib) # [1] "Alcohol" "Sugar-free extract" # [3] "Fixed acidity" "Tartaric acid" # [5] "Malic acid" "Uronic acids" # [7] "pH" "Ash" # [9] "Alkalinity of ash" "Potassium" #[11] "Calcium" "Magnesium" #[13] "Phosphate" "Chloride" #[15] "Total phenols" "Flavanoids" #[17] "Nonflavanoid phenols" "Proanthocyanins" #[19] "Color intensity" "Hue" #[21] "OD280/OD315 of diluted wines" "OD280/OD315 of flavonoids" #[23] "Glycerol" "2-3-butanediol" #[25] "Total nitrogen" "Proline" #[27] "Methanol" # PCA-cor whitened data Z = whiten(forina1986$attrib, method="PCA-cor") wt = as.integer(forina1986$type) plot(Z[,1], Z[,2], xlab=expression(paste(Z[1])), ylab=expression(paste(Z[2])), main="Forina 1986 Wine Data", sub="PCA-cor Whitening", col=wt, pch=wt+14) legend("topright", levels(forina1986$type)[1:3], col=1:3, pch=(1:3)+14 ) ## relationship to UCI wine data # UCI wine data is a subset uciwine.attrib = forina1986$attrib[, c("Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")] # two small differences compared to UCI wine data matrix uciwine.attrib[172,"Color intensity"] # 9.9 but 9.899999 in UCI matrix uciwine.attrib[71,"Hue"] # 0.91 but 0.906 in UCI matrix
# load whitening library library("whitening") # load Forina 1986 wine data set data(forina1986) table(forina1986$type) # Barolo Grignolino Barbera # 59 71 48 dim(forina1986$attrib) # 178 27 colnames(forina1986$attrib) # [1] "Alcohol" "Sugar-free extract" # [3] "Fixed acidity" "Tartaric acid" # [5] "Malic acid" "Uronic acids" # [7] "pH" "Ash" # [9] "Alkalinity of ash" "Potassium" #[11] "Calcium" "Magnesium" #[13] "Phosphate" "Chloride" #[15] "Total phenols" "Flavanoids" #[17] "Nonflavanoid phenols" "Proanthocyanins" #[19] "Color intensity" "Hue" #[21] "OD280/OD315 of diluted wines" "OD280/OD315 of flavonoids" #[23] "Glycerol" "2-3-butanediol" #[25] "Total nitrogen" "Proline" #[27] "Methanol" # PCA-cor whitened data Z = whiten(forina1986$attrib, method="PCA-cor") wt = as.integer(forina1986$type) plot(Z[,1], Z[,2], xlab=expression(paste(Z[1])), ylab=expression(paste(Z[2])), main="Forina 1986 Wine Data", sub="PCA-cor Whitening", col=wt, pch=wt+14) legend("topright", levels(forina1986$type)[1:3], col=1:3, pch=(1:3)+14 ) ## relationship to UCI wine data # UCI wine data is a subset uciwine.attrib = forina1986$attrib[, c("Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")] # two small differences compared to UCI wine data matrix uciwine.attrib[172,"Color intensity"] # 9.9 but 9.899999 in UCI matrix uciwine.attrib[71,"Hue"] # 0.91 but 0.906 in UCI matrix
A preprocessed sample of gene expression and methylation data as well as selected clinical covariates for 130 patients with lung squamous cell carcinoma (LUSC) as available from The Cancer Genome Atlas (TCGA) database (Kandoth et al. 2013).
data(lusc)
data(lusc)
lusc$rnaseq2
is a 130 x 206 matrix containing the calibrated gene expression
levels of 206 genes for 130 patients.
lusc$methyl
is a 130 x 234 matrix containing the methylation levels
of 234 probes for 130 patients.
sex
is a vector recording the sex (male vs. female) of the 130 patients.
packs
is the number of cigarette packs per year smoked by each patient.
survivalTime
is number of days to last follow-up or the days to death.
censoringStatus
is the vital status (0=alive, 1=dead).
This data set is used to illustrate CCA-based data integration in Jendoubi and Strimmer (2019) and also described in Wan et al. (2016).
The data were retrieved from TCGA (Kandoth et al. 2014) using the TCGA2STAT tool following the guidelines and the preprocessing steps detailed in Wan et al. (2016).
Jendoubi, T., Strimmer, K.: A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics 20:15 <DOI:10.1186/s12859-018-2572-9>
Kandoth, C., McLellan, M.D., Vandin, F., Ye, K., Niu, B., Lu, C., Xie, M., andJ. F. McMichael, Q.Z., Wyczalkowski, M.A., Leiserson, M.D.M., Miller, C.A., Welch, J.S., Walter, M.J., Wendl, M.C., Ley, T.J., Wilson, R.K., Raphael, B.J., Ding, L.: Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013). <DOI:10.1038/nature12634>
Wan, Y.-W., Allen, G.I., Liu, Z.: TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics 32, 952–954 (2016). <DOI:10.1093/bioinformatics/btv677>
# load whitening library library("whitening") # load TGCA LUSC data set data(lusc) names(lusc) #"rnaseq2" "methyl" "sex" "packs" #"survivalTime" "censoringStatus" dim(lusc$rnaseq2) # 130 206 gene expression dim(lusc$methyl) # 130 234 methylation level ## Not run: library("survival") s = Surv(lusc$survivalTime, lusc$censoringStatus) plot(survfit(s ~ lusc$sex), xlab = "Years", ylab = "Probability of survival", lty=c(2,1), lwd=2) legend("topright", legend = c("male", "female"), lty =c(1,2), lwd=2) ## End(Not run)
# load whitening library library("whitening") # load TGCA LUSC data set data(lusc) names(lusc) #"rnaseq2" "methyl" "sex" "packs" #"survivalTime" "censoringStatus" dim(lusc$rnaseq2) # 130 206 gene expression dim(lusc$methyl) # 130 234 methylation level ## Not run: library("survival") s = Surv(lusc$survivalTime, lusc$censoringStatus) plot(survfit(s ~ lusc$sex), xlab = "Years", ylab = "Probability of survival", lty=c(2,1), lwd=2) legend("topright", legend = c("male", "female"), lty =c(1,2), lwd=2) ## End(Not run)
The nutrimouse
dataset is a collection of gene expression and lipid measurements
collected in a nutrigenomic study in the mouse studying 40 animals by Martin et al. (2007).
data(nutrimouse)
data(nutrimouse)
A list containing the following components:
gene
collects gene expression of 120 genes in liver tissue for 40 mice.
lipid
collects concentrations of 21 lipids for 40 mice.
diet
describes the diet of each mouse ("coc", "fish", "lin", "ref", or "sun").
genotype
describes the genotype of each mouse: wild type ("wt") or PPARalpha deficient ("ppar").
This data set is used to illustrate CCA-based data integration in Jendoubi and Strimmer (2019) and is also described in Gonzalez et al. (2008).
The original data are available in the CCA
package by Gonzalez et al. (2008), see their function nutrimouse
.
Gonzalez, I., Dejean, S., Martin, P.G.P, Baccini, A. CCA: an R package to extend canonical correlation analysis. J. Statist. Software 23:1–13 (2008)
Jendoubi, T., Strimmer, K.: A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics 20:15 <DOI:10.1186/s12859-018-2572-9>
Martin, P.G.P., Guillou, H., Lasserre, F., Dejean, S., Lan, A., Pascussi, J.-M., Cristobal, M.S., Legrand, P., Besse, P., Pineau, T.: Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a multigenomic study. Hepatology 54, 767–777 (2007) <DOI:10.1002/hep.21510>
# load whitening library library("whitening") # load nutrimouse data set data(nutrimouse) dim(nutrimouse$gene) # 40 120 dim(nutrimouse$lipid) # 40 21 levels( nutrimouse$diet ) # "coc" "fish" "lin" "ref" "sun" levels( nutrimouse$genotype ) # "wt" "ppar"
# load whitening library library("whitening") # load nutrimouse data set data(nutrimouse) dim(nutrimouse$gene) # 40 120 dim(nutrimouse$lipid) # 40 21 levels( nutrimouse$diet ) # "coc" "fish" "lin" "ref" "sun" levels( nutrimouse$genotype ) # "wt" "ppar"
Pit prop timber is used in construction to build mining tunnels.
The pitprops14
data is described in Jeffers (1967) and is
a correlation matrix that was calculated from measuring 14 physical properties
of 180 pit props made from wood from Corsican pines grown in East Anglia, UK.
data(pitprops14)
data(pitprops14)
A correlation matrix of dimension 14 times 14.
The 14 variables are described in Jeffers (1967) as follows:
topdiam
: the top diameter of the prop in inches;
length
: the length of the prop in inches;
moist
: the moisture content of the prop, expressed as a percentage of the dry weight;
testsg
: the specific gravity of the timber at the time of the test;
ovensg
: the oven-dry specific gravity of the timber;
rinotop
: the number of annual rings at the top of the prop;
ringbut
: the number of annual rings at the base of the prop;
bowmax
: the maximum bow in inches;
bowdist
: the distance of the point of maximum bow from the top of the prop in inches;
whorls
: the number of knot whorls;
clear
: the length of clear prop from the top of the prop in inches;
knots
: the average number of knots per whorl;
diaknot
: the average diameter of the knots in inches;
maxcs
: the maximum compressive strength in lb per square inch.
The data set is printed in Jeffers (1967) in Table 2 and Table 5.
Jeffers, J. N. R. 1967. Two case studies in the application of principal component analysis. JRSS C (Applied Statistics) 16: 225-236. <DOI:10.2307/2985919>
# load whitening library library("whitening") # load pitprops14 data set data(pitprops14) colnames(pitprops14) # correlation matrix for the first 13 variables pitprops13 = pitprops14[1:13, 1:13] # correlation loadings for PCA whitening Psi = whiteningLoadings(pitprops13, "PCA")$Psi # corresponding explained variation Psi.explained = explainedVariation(Psi) # the first six whitened variables account for 87% of the variation cumsum(Psi.explained)/13*100
# load whitening library library("whitening") # load pitprops14 data set data(pitprops14) colnames(pitprops14) # correlation matrix for the first 13 variables pitprops13 = pitprops14[1:13, 1:13] # correlation loadings for PCA whitening Psi = whiteningLoadings(pitprops13, "PCA")$Psi # corresponding explained variation Psi.explained = explainedVariation(Psi) # the first six whitened variables account for 87% of the variation cumsum(Psi.explained)/13*100
scca
computes canonical correlations and directions using a shrinkage estimate of the joint
correlation matrix of and
.
cca
computes canonical correlations and directions based on empirical correlations.
scca(X, Y, lambda.cor, scale=TRUE, verbose=TRUE) cca(X, Y, scale=TRUE)
scca(X, Y, lambda.cor, scale=TRUE, verbose=TRUE) cca(X, Y, scale=TRUE)
X |
First data matrix, with samples in rows and variables in columns. |
Y |
Second data matrix, with samples in rows and variables in columns. |
lambda.cor |
Shrinkage intensity for estimating the joint correlation matrix -
see |
scale |
Determines whether canonical directions are computed for standardized or raw data. Note that if data are not standardized the canonical directions contain the scale of the variables. |
verbose |
Report shrinkage intensities. |
The canonical directions in this function are scaled in such a way that they correspond to whitening matrices - see Jendoubi and Strimmer (2019) for details. Note that the sign convention for the canonical directions employed here allows purposely for both positive and negative canonical correlations.
The function scca
uses some clever matrix algebra to avoid computation of full correlation matrices, and hence can be applied to high-dimensional data sets - see Jendoubi and Strimmer (2019) for details.
cca
it is a shortcut for running scca
with lambda.cor=0
and verbose=FALSE
.
If scale=FALSE
the standard deviations needed for the canonical directions are estimated
by apply(X, 2, sd)
and apply(X, 2, sd)
.
If or
contains only a single variable the correlation-adjusted cross-correlations
reduce to the CAR score (see function
carscore
in the care
package) described in Strimmer and Zuber (2011).
scca
and cca
return a list with the following components:
K
- the correlation-adjusted cross-correlations.
lambda
- the canonical correlations.
WX
and WY
- the whitening matrices for and
, with canonical directions in the rows. If
scale=FALSE
then canonical directions include scale of the data, if scale=TRUE
then only correlations are needed to compute the canonical directions.
PhiX
and PhiY
- the loadings for and
. If
scale=TRUE
then these are the correlation loadings, i.e. the correlations between
the whitened variables and the original variables.
scale
- whether data was standardized (if scale=FALSE
then canonical directions include scale of the data).
lambda.cor
- shrinkage intensity used for estimating the correlations (0 for empirical estimator)
lambda.cor.estimated
- indicates whether shrinkage intenstiy was specified or estimated.
Korbinian Strimmer (https://strimmerlab.github.io) with Takoua Jendoubi.
Jendoubi, T., and K. Strimmer 2019. A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics 20: 15. <DOI:10.1186/s12859-018-2572-9>
Zuber, V., and K. Strimmer. 2011. High-dimensional regression and variable selection using CAR scores. Statist. Appl. Genet. Mol. Biol. 10: 34. <DOI:10.2202/1544-6115.1730>
cancor
and whiteningMatrix
.
# load whitening library library("whitening") # example data set data(LifeCycleSavings) X = as.matrix( LifeCycleSavings[, 2:3] ) Y = as.matrix( LifeCycleSavings[, -(2:3)] ) n = nrow(X) colnames(X) # "pop15" "pop75" colnames(Y) # "sr" "dpi" "ddpi" # CCA cca.out = cca(X, Y, scale=TRUE) cca.out$lambda # canonical correlations cca.out$WX # whitening matrix / canonical directions X cca.out$WY # whitening matrix / canonical directions Y cca.out$K # correlation-adjusted cross-correlations cca.out$PhiX # correlation loadings X cca.out$PhiX # correlation loadings Y corplot(cca.out, X, Y) loadplot(cca.out, 2) # column sums of squared correlation loadings add to 1 colSums(cca.out$PhiX^2) # CCA whitened data CCAX = tcrossprod( scale(X), cca.out$WX ) CCAY = tcrossprod( scale(Y), cca.out$WY ) zapsmall(cov(CCAX)) zapsmall(cov(CCAY)) zapsmall(cov(CCAX,CCAY)) # canonical correlations # compare with built-in function cancor # note different signs in correlations and directions! cancor.out = cancor(scale(X), scale(Y)) cancor.out$cor # canonical correlations t(cancor.out$xcoef)*sqrt(n-1) # canonical directions X t(cancor.out$ycoef)*sqrt(n-1) # canonical directions Y ## see "User guides, package vignettes and other documentation" ## for examples with high-dimensional data using the scca function
# load whitening library library("whitening") # example data set data(LifeCycleSavings) X = as.matrix( LifeCycleSavings[, 2:3] ) Y = as.matrix( LifeCycleSavings[, -(2:3)] ) n = nrow(X) colnames(X) # "pop15" "pop75" colnames(Y) # "sr" "dpi" "ddpi" # CCA cca.out = cca(X, Y, scale=TRUE) cca.out$lambda # canonical correlations cca.out$WX # whitening matrix / canonical directions X cca.out$WY # whitening matrix / canonical directions Y cca.out$K # correlation-adjusted cross-correlations cca.out$PhiX # correlation loadings X cca.out$PhiX # correlation loadings Y corplot(cca.out, X, Y) loadplot(cca.out, 2) # column sums of squared correlation loadings add to 1 colSums(cca.out$PhiX^2) # CCA whitened data CCAX = tcrossprod( scale(X), cca.out$WX ) CCAY = tcrossprod( scale(Y), cca.out$WY ) zapsmall(cov(CCAX)) zapsmall(cov(CCAY)) zapsmall(cov(CCAX,CCAY)) # canonical correlations # compare with built-in function cancor # note different signs in correlations and directions! cancor.out = cancor(scale(X), scale(Y)) cancor.out$cor # canonical correlations t(cancor.out$xcoef)*sqrt(n-1) # canonical directions X t(cancor.out$ycoef)*sqrt(n-1) # canonical directions Y ## see "User guides, package vignettes and other documentation" ## for examples with high-dimensional data using the scca function
simOrtho
generates a random orthogonal matrix.
simOrtho(d, nonNegDiag = FALSE)
simOrtho(d, nonNegDiag = FALSE)
d |
The dimension of the orthogonal matrix. |
nonNegDiag |
force the elements on the diagonal to be nonnegative (default: FALSE). |
The algorithm is based on QR decomposition of a random matrix, see Section 3 page 404 in Stewart (1980).
simOrtho
returns a real matrix of size times
.
Korbinian Strimmer (https://strimmerlab.github.io).
G.W. Stewart. 1980. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM J. Numer. Anal. 17:403-409. <DOI:10.1137/0717034>
# load whitening library library("whitening") # simulate random orthogonal matrix Q = simOrtho(4) # matrix of dimension 4x4 Q zapsmall( crossprod(Q) ) zapsmall( tcrossprod(Q) )
# load whitening library library("whitening") # simulate random orthogonal matrix Q = simOrtho(4) # matrix of dimension 4x4 Q zapsmall( crossprod(Q) ) zapsmall( tcrossprod(Q) )
whiten
whitens a data matrix using the empirical covariance matrix
as basis for computing the whitening transformation.
whiten(X, center=FALSE, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
whiten(X, center=FALSE, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
X |
Data matrix, with samples in rows and variables in columns. |
center |
Center columns to mean zero. |
method |
Determines the type of whitening transformation (see Details). |
The following whitening approaches can be selected:
method="ZCA"
and method="ZCA-cov"
: ZCA whitening, also known as Mahalanobis whitening, ensures that the average covariance between whitened and orginal variables is maximal.
method="ZCA-cor"
: Likewise, ZCA-cor whitening leads to whitened variables that are maximally correlated (on average) with the original variables.
method="PCA"
and method="PCA-cov"
: In contrast, PCA whitening lead to maximally compressed whitened variables, as measured by squared covariance.
method="PCA-cor"
: PCA-cor whitening is similar to PCA whitening but uses squared correlations.
method="Cholesky"
: computes a whitening matrix by applying Cholesky decomposition. This yields both a lower triangular positive diagonal whitening matrix and lower triangular positive diagonal loadings (cross-covariance and cross-correlation).
Note that Cholesky whitening depends on the ordering of input variables. In the convention used here the first input variable is linked with the first latent variable only, the second input variable is linked to the first and second latent variable only, and so on, and the last variable is linked to all latent variables.
ZCA-cor whitening is implicitely employed in computing CAT and CAR scores used for variable selection in classification and regression, see the functions catscore
in the sda
package and carscore
in the care
package.
In both PCA and PCA-cor whitening there is a sign-ambiguity in the eigenvector matrices. In order to resolve the sign-ambiguity we use eigenvector matrices with a positive diagonal so that PCA and PCA-cor cross-correlations and cross-covariances have a positive diagonal for the given ordering of the original variables.
For details see Kessy, Lewin, and Strimmer (2018).
Canonical correlation analysis (CCA) can also be understood as a special form of whitening (also implemented in this package).
whiten
returns the whitened data matrix .
Korbinian Strimmer (https://strimmerlab.github.io) with Agnan Kessy and Alex Lewin.
Kessy, A., A. Lewin, and K. Strimmer. 2018. Optimal whitening and decorrelation. The American Statistician. 72: 309-314. <DOI:10.1080/00031305.2016.1277159>
whiteningMatrix
, whiteningLoadings
, scca
.
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # whitened data Z.ZCAcor = whiten(X, method="ZCA-cor") # check covariance matrix zapsmall( cov(Z.ZCAcor) )
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # whitened data Z.ZCAcor = whiten(X, method="ZCA-cor") # check covariance matrix zapsmall( cov(Z.ZCAcor) )
whiteningLoading
computes the loadings (= cross-covariance matrix ) between the original and the whitened variables as well as the correlation loadings (= cross-correlation matrix
). The original variables are in the rows and the whitened variables in the columns.
whiteningLoadings(Sigma, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
whiteningLoadings(Sigma, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
Sigma |
Covariance matrix. |
method |
Determines the type of whitening transformation. |
is the cross-covariance matrix between the original and the whitened variables. It satisfies
. This cross-covariance matrix is the inverse of the whitening matrix so that
. The cross-covariance matrix is therefore relevant in inverse whitening transformations (=coloring transformations)
.
is the cross-correlation matrix between the original and the whitened variables.
The following different whitening approaches can be selected:
method="ZCA"
: ZCA whitening, also known as Mahalanobis whitening, ensures that the average covariance between whitened and orginal variables is maximal.
method="ZCA-cor"
: Likewise, ZCA-cor whitening leads to whitened variables that are maximally correlated (on average) with the original variables.
method="PCA"
: In contrast, PCA whitening lead to maximally compressed whitened variables, as measured by squared covariance.
method="PCA-cor"
: PCA-cor whitening is similar to PCA whitening but uses squared correlations.
method="Cholesky"
: computes a whitening matrix by applying Cholesky decomposition. This yields both a lower triangular positive diagonal whitening matrix and lower triangular positive diagonal loadings (cross-covariance and cross-correlation).
Note that Cholesky whitening depends on the ordering of input variables. In the convention used here the first input variable is linked with the first latent variable only, the second input variable is linked to the first and second latent variable only, and so on, and the last variable is linked to all latent variables.
ZCA-cor whitening is implicitely employed in computing CAT and CAR scores used for variable selection in classification and regression, see the functions catscore
in the sda
package and carscore
in the care
package.
In both PCA and PCA-cor whitening there is a sign-ambiguity in the eigenvector matrices. In order to resolve the sign-ambiguity we use eigenvector matrices with a positive diagonal so that PCA and PCA-cor cross-correlations and cross-covariances have a positive diagonal for the given ordering of the original variables.
For details see Kessy, Lewin, and Strimmer (2018).
whiteningLoadings
returns a list with the following items:
Phi
- cross-covariance matrix - the loadings.
Psi
- cross-correlation matrix - the correlation loadings.
Korbinian Strimmer (https://strimmerlab.github.io).
Kessy, A., A. Lewin, and K. Strimmer. 2018. Optimal whitening and decorrelation. The American Statistician. 72: 309-314. <DOI:10.1080/00031305.2016.1277159>
explainedVariation
, whiten
, whiteningMatrix
.
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # ZCA-cor whitening matrix W.ZCAcor = whiteningMatrix(S, method="ZCA-cor") # ZCA-cor loadings ldgs = whiteningLoadings(S, method="ZCA-cor") ldgs # cross-covariance matrix Phi.ZCAcor = ldgs$Phi # check constraint of cross-covariance matrix tcrossprod(Phi.ZCAcor) S # cross-covariance matrix aka loadings is equal to the inverse whitening matrix Phi.ZCAcor solve(W.ZCAcor)
# load whitening library library("whitening") ###### # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # ZCA-cor whitening matrix W.ZCAcor = whiteningMatrix(S, method="ZCA-cor") # ZCA-cor loadings ldgs = whiteningLoadings(S, method="ZCA-cor") ldgs # cross-covariance matrix Phi.ZCAcor = ldgs$Phi # check constraint of cross-covariance matrix tcrossprod(Phi.ZCAcor) S # cross-covariance matrix aka loadings is equal to the inverse whitening matrix Phi.ZCAcor solve(W.ZCAcor)
whiteningMatrix
computes the whitening matrix .
whiteningMatrix(Sigma, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
whiteningMatrix(Sigma, method=c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky"))
Sigma |
Covariance matrix. |
method |
Determines the type of whitening transformation. |
Whitening is a linear transformation where the whitening matrix satisfies the constraint
where
.
This function implements various natural whitening transformations discussed in Kessy, Lewin, and Strimmer (2018).
The following different whitening approaches can be selected:
method="ZCA"
: ZCA whitening, also known as Mahalanobis whitening, ensures that the average covariance between whitened and orginal variables is maximal.
method="ZCA-cor"
: Likewise, ZCA-cor whitening leads to whitened variables that are maximally correlated (on average) with the original variables.
method="PCA"
: In contrast, PCA whitening lead to maximally compressed whitened variables, as measured by squared covariance.
method="PCA-cor"
: PCA-cor whitening is similar to PCA whitening but uses squared correlations.
method="Cholesky"
: computes a whitening matrix by applying Cholesky decomposition. This yields both a lower triangular positive diagonal whitening matrix and lower triangular positive diagonal loadings (cross-covariance and cross-correlation).
Note that Cholesky whitening depends on the ordering of input variables. In the convention used here the first input variable is linked with the first latent variable only, the second input variable is linked to the first and second latent variable only, and so on, and the last variable is linked to all latent variables.
ZCA-cor whitening is implicitely employed in computing CAT and CAR scores used for variable selection in classification and regression, see the functions catscore
in the sda
package and carscore
in the care
package.
In both PCA and PCA-cor whitening there is a sign-ambiguity in the eigenvector matrices. In order to resolve the sign-ambiguity we use eigenvector matrices with a positive diagonal so that PCA and PCA-cor cross-correlations and cross-covariances have a positive diagonal for the given ordering of the original variables.
For details see Kessy, Lewin, and Strimmer (2018).
Canonical correlation analysis (CCA) can also be understood as a special form of whitening (also implemented in this package).
whiteningMatrix
returns a square whitening matrix .
Korbinian Strimmer (https://strimmerlab.github.io) with Agnan Kessy and Alex Lewin.
Kessy, A., A. Lewin, and K. Strimmer. 2018. Optimal whitening and decorrelation. The American Statistician. 72: 309-314. <DOI:10.1080/00031305.2016.1277159>
whiten
, whiteningLoadings
, scca
.
# load whitening library library("whitening") # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # ZCA-cor whitening matrix W.ZCAcor = whiteningMatrix(S, method="ZCA-cor") # check constraint on the whitening matrix crossprod(W.ZCAcor) solve(S) # whitened data Z = tcrossprod(X, W.ZCAcor) Z zapsmall( cov(Z) )
# load whitening library library("whitening") # example data set # E. Anderson. 1935. The irises of the Gaspe Peninsula. # Bull. Am. Iris Soc. 59: 2--5 data("iris") X = as.matrix(iris[,1:4]) d = ncol(X) # 4 n = nrow(X) # 150 colnames(X) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" # estimate covariance S = cov(X) # ZCA-cor whitening matrix W.ZCAcor = whiteningMatrix(S, method="ZCA-cor") # check constraint on the whitening matrix crossprod(W.ZCAcor) solve(S) # whitened data Z = tcrossprod(X, W.ZCAcor) Z zapsmall( cov(Z) )