Title: | Shrinkage Discriminant Analysis and CAT Score Variable Selection |
---|---|
Description: | Provides an efficient framework for high-dimensional linear and diagonal discriminant analysis with variable selection. The classifier is trained using James-Stein-type shrinkage estimators and predictor variables are ranked using correlation-adjusted t-scores (CAT scores). Variable selection error is controlled using false non-discovery rates or higher criticism. |
Authors: | Miika Ahdesmaki, Verena Zuber, Sebastian Gibb, and Korbinian Strimmer |
Maintainer: | Korbinian Strimmer <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.3.8 |
Built: | 2024-10-13 07:22:45 UTC |
Source: | CRAN |
This package performs linear discriminant analysis (LDA) and diagonal discriminant analysis (DDA) with variable selection using correlation-adjusted t (CAT) scores.
The classifier is trained using James-Stein-type shrinkage estimators. Variable selection is based on ranking predictors by CAT scores (LDA) or t-scores (DDA). A cutoff is chosen by false non-discovery rate (FNDR) or higher criticism (HC) thresholding.
This approach is particularly suited for high-dimensional classification with correlation among predictors. For details see Zuber and Strimmer (2009) and Ahdesm\"aki and Strimmer (2010).
Typically the functions in this package are applied in three steps:
Step 1:feature selection with sda.ranking
,
Step 2:training the classifier with sda
, and
Step 3:classification using predict.sda
.
The accompanying web site (see below) provides example R scripts to illustrate the functionality of this package.
Miika Ahdesm\"aki, Verena Zuber and Korbinian Strimmer (https://strimmerlab.github.io/)
Ahdesm\"aki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. <DOI:10.1214/09-AOAS277>
Zuber, V., and K. Strimmer. 2009. Gene ranking and biomarker discovery under correlation. Bioinformatics 25: 2700-2707. <DOI:10.1093/bioinformatics/btp460>
See website: https://strimmerlab.github.io/software/sda/
catscore
, sda.ranking
, sda
, predict.sda
.
catscore
computes CAT scores
(correlation-adjusted t-scores)
between the group centroids and the pooled mean.
catscore(Xtrain, L, lambda, lambda.var, lambda.freqs, diagonal=FALSE, verbose=TRUE)
catscore(Xtrain, L, lambda, lambda.var, lambda.freqs, diagonal=FALSE, verbose=TRUE)
Xtrain |
A matrix containing the training data set. Note that the rows correspond to observations and the columns to variables. |
L |
A factor with the class labels of the training samples. |
lambda |
Shrinkage intensity for the correlation matrix. If not specified it is
estimated from the data. |
lambda.var |
Shrinkage intensity for the variances. If not specified it is
estimated from the data. |
lambda.freqs |
Shrinkage intensity for the frequencies. If not specified it is
estimated from the data. |
diagonal |
for |
verbose |
Print out some info while computing. |
CAT scores generalize conventional t-scores to account for correlation among predictors (Zuber and Strimmer 2009). If there is no correlation then CAR scores reduce to t-scores. The squared CAR scores provide a decomposition of Hotelling's T^2 statistic.
CAT scores for two classes are described in Zuber and Strimmer (2009), for the multi-class case see Ahdesm\"aki and Strimmer (2010).
The scale factors for t-scores and CAT-scores are computed from the estimated frequencies
(for empirical scale factors set lambda.freqs=0
).
catscore
returns a matrix containing the cat score (or t-score) between
each group centroid and the pooled mean for each feature.
Verena Zuber, Miika Ahdesm\"aki and Korbinian Strimmer (https://strimmerlab.github.io).
Ahdesm\"aki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. <DOI:10.1214/09-AOAS277>
Zuber, V., and K. Strimmer. 2009. Gene ranking and biomarker discovery under correlation. Bioinformatics 25: 2700-2707. <DOI:10.1093/bioinformatics/btp460>
# load sda library library("sda") ################# # training data # ################# # prostate cancer set data(singh2002) # training data Xtrain = singh2002$x Ytrain = singh2002$y dim(Xtrain) #################################################### # shrinkage t-score (DDA setting - no correlation) # #################################################### tstat = catscore(Xtrain, Ytrain, diagonal=TRUE) dim(tstat) tstat[1:10,] ######################################################## # shrinkage CAT score (LDA setting - with correlation) # ######################################################## cat = catscore(Xtrain, Ytrain, diagonal=FALSE) dim(cat) cat[1:10,]
# load sda library library("sda") ################# # training data # ################# # prostate cancer set data(singh2002) # training data Xtrain = singh2002$x Ytrain = singh2002$y dim(Xtrain) #################################################### # shrinkage t-score (DDA setting - no correlation) # #################################################### tstat = catscore(Xtrain, Ytrain, diagonal=TRUE) dim(tstat) tstat[1:10,] ######################################################## # shrinkage CAT score (LDA setting - with correlation) # ######################################################## cat = catscore(Xtrain, Ytrain, diagonal=FALSE) dim(cat) cat[1:10,]
centroids
computes group centroids, the pooled mean
and pooled variance, and optionally the group specific variances.
centroids(x, L, lambda.var, lambda.freqs, var.groups=FALSE, centered.data=FALSE, verbose=TRUE)
centroids(x, L, lambda.var, lambda.freqs, var.groups=FALSE, centered.data=FALSE, verbose=TRUE)
x |
A matrix containing the data set. Note that the rows are sample observations and the columns are variables. |
L |
A factor with the group labels. |
lambda.var |
Shrinkage intensity for the variances. If not specified it is
estimated from the data, see details below. |
lambda.freqs |
Shrinkage intensity for the frequencies. If not specified it is
estimated from the data. |
var.groups |
Estimate group-specific variances. |
centered.data |
Return column-centered data matrix. |
verbose |
Provide some messages while computing. |
As estimator of the variance we employ
var.shrink
as described in Opgen-Rhein and Strimmer (2007).
For the estimates of frequencies we rely on
freqs.shrink
as described in Hausser and Strimmer (2009).
Note that the pooled mean is computed using the estimated frequencies.
centroids
returns a list
with the following components:
samples |
a vector containing the samples sizes in each group, |
freqs |
a vector containing the estimated frequency in each group, |
means |
the group means and the pooled mean, |
variances |
the group-specific and the pooled variances, and |
centered.data |
a matrix containing the centered data. |
Korbinian Strimmer (https://strimmerlab.github.io).
# load sda library library("sda") ## prepare data set data(iris) # good old iris data X = as.matrix(iris[,1:4]) Y = iris[,5] ## estimate centroids and empirical pooled variances centroids(X, Y, lambda.var=0) ## also compute group-specific variances centroids(X, Y, var.groups=TRUE, lambda.var=0) ## use shrinkage estimator for the variances centroids(X, Y, var.groups=TRUE) ## return centered data xc = centroids(X, Y, centered.data=TRUE)$centered.data apply(xc, 2, mean) ## useful, e.g., to compute the inverse pooled correlation matrix powcor.shrink(xc, alpha=-1)
# load sda library library("sda") ## prepare data set data(iris) # good old iris data X = as.matrix(iris[,1:4]) Y = iris[,5] ## estimate centroids and empirical pooled variances centroids(X, Y, lambda.var=0) ## also compute group-specific variances centroids(X, Y, var.groups=TRUE, lambda.var=0) ## use shrinkage estimator for the variances centroids(X, Y, var.groups=TRUE) ## return centered data xc = centroids(X, Y, centered.data=TRUE)$centered.data apply(xc, 2, mean) ## useful, e.g., to compute the inverse pooled correlation matrix powcor.shrink(xc, alpha=-1)
Gene expression data (2308 genes for 88 samples) from the microarray study of Khan et al. (2001).
data(khan2001)
data(khan2001)
khan2001$x
is a 88 x 2308 matrix containing the expression levels. Note that
rows correspond to samples, and columns to genes. The row names are the original
image IDs, and the column names the orginal probe labels.
khan2001$y
is a factor containing the diagnosis for each sample ("BL", "EWS", "NB", "non-SRBCT", "RMS").
khan2001$descr
provides some annotation for each gene.
This data set contains measurements of the gene expression of 2308 genes for 88 observations: 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), 25 cases of rhabdomyosarcoma (RMS), and 5 other (non-SRBCT) samples.
The data are described in Khan et al. (2001). Note that the values in
khan.data$x
are logarithmized (using natural log
) for normalization.
Khan et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7:673–679.
# load sda library library("sda") # load full Khan et al (2001) data set data(khan2001) dim(khan2001$x) # 88 2308 hist(khan2001$x) khan2001$y # 5 levels # data set containing the SRBCT samples get.srbct = function() { data(khan2001) idx = which( khan2001$y == "non-SRBCT" ) x = khan2001$x[-idx,] y = factor(khan2001$y[-idx]) descr = khan2001$descr[-idx] list(x=x, y=y, descr=descr) } srbct = get.srbct() dim(srbct$x) # 83 2308 hist(srbct$x) srbct$y # 4 levels
# load sda library library("sda") # load full Khan et al (2001) data set data(khan2001) dim(khan2001$x) # 88 2308 hist(khan2001$x) khan2001$y # 5 levels # data set containing the SRBCT samples get.srbct = function() { data(khan2001) idx = which( khan2001$y == "non-SRBCT" ) x = khan2001$x[-idx,] y = factor(khan2001$y[-idx]) descr = khan2001$descr[-idx] list(x=x, y=y, descr=descr) } srbct = get.srbct() dim(srbct$x) # 83 2308 hist(srbct$x) srbct$y # 4 levels
predict.sda
performs class prediction.
## S3 method for class 'sda' predict(object, Xtest, verbose=TRUE, ...)
## S3 method for class 'sda' predict(object, Xtest, verbose=TRUE, ...)
object |
An |
Xtest |
A matrix containing the test data set. Note that the rows correspond to observations and the columns to variables. |
verbose |
Report shrinkage intensities (sda) and number of used features (predict.sda). |
... |
Additional arguments for generic predict. |
predict.sda
predicts class probabilities for each test sample and returns
a list with two components:
class |
a factor with the most probable class assignment for each test sample, and |
posterior |
a matrix containing the respective class posterior probabilities. |
Miiika Ahdesm\"aki and Korbinian Strimmer (https://strimmerlab.github.io).
# see the examples at the "sda" help page
# see the examples at the "sda" help page
sda
trains a LDA or DDA classifier using James-Stein-type shrinkage estimation.
sda(Xtrain, L, lambda, lambda.var, lambda.freqs, diagonal=FALSE, verbose=TRUE)
sda(Xtrain, L, lambda, lambda.var, lambda.freqs, diagonal=FALSE, verbose=TRUE)
Xtrain |
A matrix containing the training data set. Note that the rows correspond to observations and the columns to variables. |
L |
A factor with the class labels of the training samples. |
lambda |
Shrinkage intensity for the correlation matrix. If not specified it is
estimated from the data. |
lambda.var |
Shrinkage intensity for the variances. If not specified it is
estimated from the data. |
lambda.freqs |
Shrinkage intensity for the frequencies. If not specified it is
estimated from the data. |
diagonal |
Chooses between LDA (default, |
verbose |
Print out some info while computing. |
In order to train the LDA or DDA classifier, three separate shrinkage estimators are employed:
class frequencies: the estimator freqs.shrink
from Hausser and Strimmer (2008),
variances: the estimator var.shrink
from Opgen-Rhein and Strimmer (2007),
correlations: the estimator cor.shrink
from Sch\"afer and Strimmer (2005).
Note that the three corresponding regularization parameters are obtained analytically without resorting to computer intensive resampling.
sda
trains the classifier and returns an sda
object
with the following components needed for the subsequent prediction:
regularization |
a vector containing the three estimated shrinkage intensities, |
freqs |
the estimated class frequencies, |
alpha |
vector containing the intercepts used for prediction, |
beta |
matrix containing the coefficients used for prediction. |
Miika Ahdesm\"aki and Korbinian Strimmer (https://strimmerlab.github.io).
Ahdesm\"aki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. <DOI:10.1214/09-AOAS277>
predict.sda
, sda.ranking
,
freqs.shrink
,
var.shrink
,
invcor.shrink
.
# load sda library library("sda") ########################## # training and test data # ########################## # data set containing the SRBCT samples get.srbct = function() { data(khan2001) idx = which( khan2001$y == "non-SRBCT" ) x = khan2001$x[-idx,] y = factor(khan2001$y[-idx]) descr = khan2001$descr[-idx] list(x=x, y=y, descr=descr) } srbct = get.srbct() # training data Xtrain = srbct$x[1:63,] Ytrain = srbct$y[1:63] Xtest = srbct$x[64:83,] Ytest = srbct$y[64:83] ################################################### # classification with correlation (shrinkage LDA) # ################################################### sda.fit = sda(Xtrain, Ytrain) ynew = predict(sda.fit, Xtest)$class # using all 2308 features sum(ynew != Ytest) ########################################################### # classification with diagonal covariance (shrinkage DDA) # ########################################################### sda.fit = sda(Xtrain, Ytrain, diagonal=TRUE) ynew = predict(sda.fit, Xtest)$class # using all 2308 features sum(ynew != Ytest) ################################################################# # for complete example scripts illustrating classification with # # feature selection visit https://strimmerlab.github.io/software/sda/ # #################################################################
# load sda library library("sda") ########################## # training and test data # ########################## # data set containing the SRBCT samples get.srbct = function() { data(khan2001) idx = which( khan2001$y == "non-SRBCT" ) x = khan2001$x[-idx,] y = factor(khan2001$y[-idx]) descr = khan2001$descr[-idx] list(x=x, y=y, descr=descr) } srbct = get.srbct() # training data Xtrain = srbct$x[1:63,] Ytrain = srbct$y[1:63] Xtest = srbct$x[64:83,] Ytest = srbct$y[64:83] ################################################### # classification with correlation (shrinkage LDA) # ################################################### sda.fit = sda(Xtrain, Ytrain) ynew = predict(sda.fit, Xtest)$class # using all 2308 features sum(ynew != Ytest) ########################################################### # classification with diagonal covariance (shrinkage DDA) # ########################################################### sda.fit = sda(Xtrain, Ytrain, diagonal=TRUE) ynew = predict(sda.fit, Xtest)$class # using all 2308 features sum(ynew != Ytest) ################################################################# # for complete example scripts illustrating classification with # # feature selection visit https://strimmerlab.github.io/software/sda/ # #################################################################
sda.ranking
determines a ranking of predictors by computing CAT scores
(correlation-adjusted t-scores)
between the group centroids and the pooled mean.
plot.sda.ranking
provides a graphical visualization of the top ranking features.
sda.ranking(Xtrain, L, lambda, lambda.var, lambda.freqs, ranking.score=c("entropy", "avg", "max"), diagonal=FALSE, fdr=TRUE, plot.fdr=FALSE, verbose=TRUE) ## S3 method for class 'sda.ranking' plot(x, top=40, arrow.col="blue", zeroaxis.col="red", ylab="Features", main, ...)
sda.ranking(Xtrain, L, lambda, lambda.var, lambda.freqs, ranking.score=c("entropy", "avg", "max"), diagonal=FALSE, fdr=TRUE, plot.fdr=FALSE, verbose=TRUE) ## S3 method for class 'sda.ranking' plot(x, top=40, arrow.col="blue", zeroaxis.col="red", ylab="Features", main, ...)
Xtrain |
A matrix containing the training data set. Note that the rows correspond to observations and the columns to variables. |
L |
A factor with the class labels of the training samples. |
lambda |
Shrinkage intensity for the correlation matrix. If not specified it is
estimated from the data. |
lambda.var |
Shrinkage intensity for the variances. If not specified it is
estimated from the data. |
lambda.freqs |
Shrinkage intensity for the frequencies. If not specified it is
estimated from the data. |
diagonal |
Chooses between LDA (default, |
ranking.score |
how to compute the summary score for each variable from the CAT scores of all classes - see Details. |
fdr |
compute FDR values and HC scores for each feature. |
plot.fdr |
Show plot with estimated FDR values. |
verbose |
Print out some info while computing. |
x |
An "sda.ranking" object – this is produced by the sda.ranking() function. |
top |
The number of top-ranking features shown in the plot (default: 40). |
arrow.col |
Color of the arrows in the plot (default is |
zeroaxis.col |
Color for the center zero axis (default is |
ylab |
Label written next to feature list (default is |
main |
Main title (if missing, |
... |
Other options passed on to generic plot(). |
For each predictor variable and centroid a shrinkage CAT scores of the mean versus the pooled mean is computed. If there are only two classes the CAT score vs. the pooled mean reduces to the CAT score between the two group means. Moreover, in the diagonal case (LDA) the (shrinkage) CAT score reduces to the (shrinkage) t-score.
The overall ranking of a feature is determine by computing a summary score from the CAT scores.
This is controlled by the option ranking.score
. The default setting
(ranking.score="entropy"
) uses mutual information
between the response and the respective predictors (ranking.score
) for ranking. This is equivalent to
a weighted sum of squared CAT scores across the classes. Another possibility is to employ
the average of the squared CAT scores for ranking (as suggested in Ahdesm\"aki and Strimmer 2010)
by setting ranking.score="avg"
. A third option is to use the maximum of the squared CAT scores across groups (similarly as in the PAM algorithm) via setting ranking.score="max"
.
Note that in the case of two classes all three options are equivalent and
lead to identical scores. Thus, the choice of ranking.score
is important only
in the multi-class setting. In the two-class case the features are simply ranked according to the
(shrinkage) squared CAT-scores (or t-scores, if there is no correlation among predictors).
The current default approach is to use ranking by mutual information (i.e. relative entropy
between full model vs. model without predictor) and to use shrinkage estimators of frequencies.
In order to reproduce exactly the ranking computed by previous versions (1.1.0 to 1.3.0) of the sda
package set the options ranking.score="avg"
and lambda.freqs=0
.
Calling sda.ranking
is step 1 in a classification analysis with the
sda package. Steps 2 and 3 are
sda
and predict.sda
See Zuber and Strimmer (2009) for CAT scores in general, and Ahdesm\"aki and Strimmer (2010) for details on multi-class CAT scores. For shrinkage t scores see Opgen-Rhein and Strimmer (2007).
sda.ranking
returns a matrix with the following columns:
idx |
original feature number |
score |
sum of the squared CAT scores across groups - this determines the overall ranking of a feature |
cat |
for each group and feature the cat score of the centroid versus the pooled mean |
If fdr=TRUE
then additionally local false discovery rate (FDR) values
as well as higher criticism (HC) scores are computed for each feature
(using fdrtool
).
Miika Ahdesm\"aki, Verena Zuber, Sebastian Gibb, and Korbinian Strimmer (https://strimmerlab.github.io).
Ahdesm\"aki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. <DOI:10.1214/09-AOAS277>
Opgen-Rhein, R., and K. Strimmer. 2007. Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Statist. Appl. Genet. Mol. Biol. 6:9. <DOI:10.2202/1544-6115.1252>
Zuber, V., and K. Strimmer. 2009. Gene ranking and biomarker discovery under correlation. Bioinformatics 25: 2700-2707. <DOI:10.1093/bioinformatics/btp460>
# load sda library library("sda") ################# # training data # ################# # prostate cancer set data(singh2002) # training data Xtrain = singh2002$x Ytrain = singh2002$y ######################################### # feature ranking (diagonal covariance) # ######################################### # ranking using t-scores (DDA) ranking.DDA = sda.ranking(Xtrain, Ytrain, diagonal=TRUE) ranking.DDA[1:10,] # plot t-scores for the top 40 genes plot(ranking.DDA, top=40) # number of features with local FDR < 0.8 # (i.e. features useful for prediction) sum(ranking.DDA[,"lfdr"] < 0.8) # number of features with local FDR < 0.2 # (i.e. significant non-null features) sum(ranking.DDA[,"lfdr"] < 0.2) # optimal feature set according to HC score plot(ranking.DDA[,"HC"], type="l") which.max( ranking.DDA[1:1000,"HC"] ) ##################################### # feature ranking (full covariance) # ##################################### # ranking using CAT-scores (LDA) ranking.LDA = sda.ranking(Xtrain, Ytrain, diagonal=FALSE) ranking.LDA[1:10,] # plot t-scores for the top 40 genes plot(ranking.LDA, top=40) # number of features with local FDR < 0.8 # (i.e. features useful for prediction) sum(ranking.LDA[,"lfdr"] < 0.8) # number of features with local FDR < 0.2 # (i.e. significant non-null features) sum(ranking.LDA[,"lfdr"] < 0.2) # optimal feature set according to HC score plot(ranking.LDA[,"HC"], type="l") which.max( ranking.LDA[1:1000,"HC"] )
# load sda library library("sda") ################# # training data # ################# # prostate cancer set data(singh2002) # training data Xtrain = singh2002$x Ytrain = singh2002$y ######################################### # feature ranking (diagonal covariance) # ######################################### # ranking using t-scores (DDA) ranking.DDA = sda.ranking(Xtrain, Ytrain, diagonal=TRUE) ranking.DDA[1:10,] # plot t-scores for the top 40 genes plot(ranking.DDA, top=40) # number of features with local FDR < 0.8 # (i.e. features useful for prediction) sum(ranking.DDA[,"lfdr"] < 0.8) # number of features with local FDR < 0.2 # (i.e. significant non-null features) sum(ranking.DDA[,"lfdr"] < 0.2) # optimal feature set according to HC score plot(ranking.DDA[,"HC"], type="l") which.max( ranking.DDA[1:1000,"HC"] ) ##################################### # feature ranking (full covariance) # ##################################### # ranking using CAT-scores (LDA) ranking.LDA = sda.ranking(Xtrain, Ytrain, diagonal=FALSE) ranking.LDA[1:10,] # plot t-scores for the top 40 genes plot(ranking.LDA, top=40) # number of features with local FDR < 0.8 # (i.e. features useful for prediction) sum(ranking.LDA[,"lfdr"] < 0.8) # number of features with local FDR < 0.2 # (i.e. significant non-null features) sum(ranking.LDA[,"lfdr"] < 0.2) # optimal feature set according to HC score plot(ranking.LDA[,"HC"], type="l") which.max( ranking.LDA[1:1000,"HC"] )
Gene expression data (6033 genes for 102 samples) from the microarray study of Singh et al. (2002).
data(singh2002)
data(singh2002)
singh2002$x
is a 102 x 6033 matrix containing the expression levels.
The rows contain the samples and the columns the genes.
singh2002$y
is a factor containing the diagnosis for each sample ("cancer" or "healthy").
This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men.
The data are described in Singh et al. (2001) and are provided in exactly the form as used by Efron (2008).
D. Singh et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209.
Efron, B. 2008. Empirical Bayes estimates for large-scale prediction problems. Technical Report, Standford University.
# load sda library library("sda") # load Singh et al (2001) data set data(singh2002) dim(singh2002$x) # 102 6033 hist(singh2002$x) singh2002$y # 2 levels
# load sda library library("sda") # load Singh et al (2001) data set data(singh2002) dim(singh2002$x) # 102 6033 hist(singh2002$x) singh2002$y # 2 levels