Title: | Sparse Partial Least Squares (SPLS) Regression and Classification |
---|---|
Description: | Provides functions for fitting a sparse partial least squares (SPLS) regression and classification (Chun and Keles (2010) <doi:10.1111/j.1467-9868.2009.00723.x>). |
Authors: | Dongjun Chung <[email protected]>, Hyonho Chun <[email protected]>, Sunduz Keles <[email protected]> |
Maintainer: | Valentin Todorov <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.2-3 |
Built: | 2024-11-03 06:59:39 UTC |
Source: | CRAN |
Calculate bootstrapped confidence intervals of coefficients of the selected predictors and generate confidence interval plots.
ci.spls( object, coverage=0.95, B=1000, plot.it=FALSE, plot.fix="y", plot.var=NA, K=object$K, fit=object$fit )
ci.spls( object, coverage=0.95, B=1000, plot.it=FALSE, plot.fix="y", plot.var=NA, K=object$K, fit=object$fit )
object |
A fitted SPLS object. |
coverage |
Coverage of confidence intervals.
|
B |
Number of bootstrap iterations. Default is 1000. |
plot.it |
Plot confidence intervals of coefficients? |
plot.fix |
If |
plot.var |
Index vector of responses (if |
K |
Number of hidden components.
Default is to use the same |
fit |
PLS algorithm for model fitting. Alternatives are
|
Invisibly returns a list with components:
cibeta |
A list with as many matrix elements as the number of responses. Each matrix element is p by 2, where i-th row of the matrix lists the upper and lower bounds of the bootstrapped confidence interval of the i-th predictor. |
betahat |
Matrix of original coefficients of the SPLS fit. |
lbmat |
Matrix of lower bounds of confidence intervals (for internal use). |
ubmat |
Matrix of upper bounds of confidence intervals (for internal use). |
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
correct.spls
and spls
.
data(mice) # SPLS with eta=0.6 & 1 hidden components f <- spls( mice$x, mice$y, K=1, eta=0.6 ) # Calculate confidence intervals of coefficients ci.f <- ci.spls( f, plot.it=TRUE, plot.fix="x", plot.var=20 ) # Bootstrapped confidence intervals cis <- ci.f$cibeta cis[[20]] # equivalent, 'cis$1422478_a_at'
data(mice) # SPLS with eta=0.6 & 1 hidden components f <- spls( mice$x, mice$y, K=1, eta=0.6 ) # Calculate confidence intervals of coefficients ci.f <- ci.spls( f, plot.it=TRUE, plot.fix="x", plot.var=20 ) # Bootstrapped confidence intervals cis <- ci.f$cibeta cis[[20]] # equivalent, 'cis$1422478_a_at'
Plot estimated coefficients of the selected predictors in the SPLS object.
coefplot.spls( object, nwin=c(2,2), xvar=c(1:length(object$A)), ylimit=NA )
coefplot.spls( object, nwin=c(2,2), xvar=c(1:length(object$A)), ylimit=NA )
object |
A fitted SPLS object. |
nwin |
Vector of the number of rows and columns in a plotting area. Default is two rows and two columns, i.e., four plots. |
xvar |
Index of variables to be plotted among the set of the selected predictors. Default is to plot the coefficients of all the selected predictors. |
ylimit |
Range of the y axis (the coefficients) in the plot.
If |
This plot is useful for visualizing coefficient estimates of a variable for different responses. Hence, the function is applicable only with multivariate response SPLS.
NULL.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
ci.spls
, and correct.spls
and
plot.spls
.
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Draw estimated coefficient plot of the first four variables # among the selected predictors coefplot.spls( f, xvar=c(1:4), nwin=c(2,2) )
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Draw estimated coefficient plot of the first four variables # among the selected predictors coefplot.spls( f, xvar=c(1:4), nwin=c(2,2) )
Correct initial SPLS coefficient estimates of the selected predictors based on bootstrapped confidence intervals and draw heatmap of original and corrected coefficient estimates.
correct.spls( object, plot.it=TRUE )
correct.spls( object, plot.it=TRUE )
object |
An object obtained from the function |
plot.it |
Draw the heatmap of original coefficient estimates and corrected coefficient estimates? |
The set of the selected variables is updated by setting the coefficients with zero-containing confidence intervals to zero.
Invisibly returns a matrix of corrected coefficient estimates.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
data(mice) # SPLS with eta=0.6 & 1 latent components f <- spls( mice$x, mice$y, K=1, eta=0.6 ) # Calculate confidence intervals of coefficients ci.f <- ci.spls(f) # Corrected coefficient estimates cf <- correct.spls( ci.f ) cf[20,1:5]
data(mice) # SPLS with eta=0.6 & 1 latent components f <- spls( mice$x, mice$y, K=1, eta=0.6 ) # Calculate confidence intervals of coefficients ci.f <- ci.spls(f) # Corrected coefficient estimates cf <- correct.spls( ci.f ) cf[20,1:5]
Draw heatmap of v-fold cross-validated misclassification rates and return optimal eta (thresholding parameter) and K (number of hidden components).
cv.sgpls( x, y, fold=10, K, eta, scale.x=TRUE, plot.it=TRUE, br=TRUE, ftype='iden', n.core=8 )
cv.sgpls( x, y, fold=10, K, eta, scale.x=TRUE, plot.it=TRUE, br=TRUE, ftype='iden', n.core=8 )
x |
Matrix of predictors. |
y |
Vector of class indices. |
fold |
Number of cross-validation folds. Default is 10-folds. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
plot.it |
Draw the heatmap of cross-validated misclassification rates? |
br |
Apply Firth's bias reduction procedure? |
ftype |
Type of Firth's bias reduction procedure.
Alternatives are |
n.core |
Number of CPUs to be used when parallel computing is utilized. |
Parallel computing can be utilized for faster computation.
Users can change the number of CPUs to be used
by changing the argument n.core
.
Invisibly returns a list with components:
err.mat |
Matrix of cross-validated misclassification rates.
Rows correspond to |
eta.opt |
Optimal |
K.opt |
Optimal |
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
print.sgpls
, predict.sgpls
,
and coef.sgpls
.
data(prostate) set.seed(1) # misclassification rate plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 5 ## Not run: cv <- cv.sgpls(prostate$x, prostate$y, K = c(1:5), eta = seq(0.1,0.9,0.1), scale.x=FALSE, fold=5) ## End(Not run) (sgpls(prostate$x, prostate$y, eta=cv$eta.opt, K=cv$K.opt, scale.x=FALSE))
data(prostate) set.seed(1) # misclassification rate plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 5 ## Not run: cv <- cv.sgpls(prostate$x, prostate$y, K = c(1:5), eta = seq(0.1,0.9,0.1), scale.x=FALSE, fold=5) ## End(Not run) (sgpls(prostate$x, prostate$y, eta=cv$eta.opt, K=cv$K.opt, scale.x=FALSE))
Draw heatmap of v-fold cross-validated mean squared prediction error and return optimal eta (thresholding parameter) and K (number of hidden components).
cv.spls( x, y, fold=10, K, eta, kappa=0.5, select="pls2", fit="simpls", scale.x=TRUE, scale.y=FALSE, plot.it=TRUE )
cv.spls( x, y, fold=10, K, eta, kappa=0.5, select="pls2", fit="simpls", scale.x=TRUE, scale.y=FALSE, plot.it=TRUE )
x |
Matrix of predictors. |
y |
Vector or matrix of responses. |
fold |
Number of cross-validation folds. Default is 10-folds. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
kappa |
Parameter to control the effect of
the concavity of the objective function
and the closeness of original and surrogate direction vectors.
|
select |
PLS algorithm for variable selection.
Alternatives are |
fit |
PLS algorithm for model fitting. Alternatives are
|
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
scale.y |
Scale responses by dividing each response variable by its sample standard deviation? |
plot.it |
Draw heatmap of cross-validated mean squared prediction error? |
Invisibly returns a list with components:
mspemat |
Matrix of cross-validated mean squared prediction error.
Rows correspond to |
eta.opt |
Optimal |
K.opt |
Optimal |
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
print.spls
, plot.spls
, predict.spls
,
and coef.spls
.
data(yeast) set.seed(1) # MSPE plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 10 ## Not run: cv <- cv.spls(yeast$x, yeast$y, K = c(1:10), eta = seq(0.1,0.9,0.1)) # Optimal eta and K cv$eta.opt cv$K.opt (spls(yeast$x, yeast$y, eta=cv$eta.opt, K=cv$K.opt)) ## End(Not run)
data(yeast) set.seed(1) # MSPE plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 10 ## Not run: cv <- cv.spls(yeast$x, yeast$y, K = c(1:10), eta = seq(0.1,0.9,0.1)) # Optimal eta and K cv$eta.opt cv$K.opt (spls(yeast$x, yeast$y, eta=cv$eta.opt, K=cv$K.opt)) ## End(Not run)
Draw heatmap of v-fold cross-validated misclassification rates and return optimal eta (thresholding parameter) and K (number of hidden components).
cv.splsda( x, y, fold=10, K, eta, kappa=0.5, classifier=c('lda','logistic'), scale.x=TRUE, plot.it=TRUE, n.core=8 )
cv.splsda( x, y, fold=10, K, eta, kappa=0.5, classifier=c('lda','logistic'), scale.x=TRUE, plot.it=TRUE, n.core=8 )
x |
Matrix of predictors. |
y |
Vector of class indices. |
fold |
Number of cross-validation folds. Default is 10-folds. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
kappa |
Parameter to control the effect of
the concavity of the objective function
and the closeness of original and surrogate direction vectors.
|
classifier |
Classifier used in the second step of SPLSDA.
Alternatives are |
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
plot.it |
Draw the heatmap of the cross-validated misclassification rates? |
n.core |
Number of CPUs to be used when parallel computing is utilized. |
Parallel computing can be utilized for faster computation.
Users can change the number of CPUs to be used
by changing the argument n.core
.
Invisibly returns a list with components:
err.mat |
Matrix of cross-validated misclassification rates.
Rows correspond to |
eta.opt |
Optimal |
K.opt |
Optimal |
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
print.splsda
, predict.splsda
,
and coef.splsda
.
data(prostate) set.seed(1) # misclassification rate plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 5 ## Not run: cv <- cv.splsda( prostate$x, prostate$y, K = c(1:5), eta = seq(0.1,0.9,0.1), scale.x=FALSE, fold=5 ) ## End(Not run) (splsda( prostate$x, prostate$y, eta=cv$eta.opt, K=cv$K.opt, scale.x=FALSE ))
data(prostate) set.seed(1) # misclassification rate plot. eta is searched between 0.1 and 0.9 and # number of hidden components is searched between 1 and 5 ## Not run: cv <- cv.splsda( prostate$x, prostate$y, K = c(1:5), eta = seq(0.1,0.9,0.1), scale.x=FALSE, fold=5 ) ## End(Not run) (splsda( prostate$x, prostate$y, eta=cv$eta.opt, K=cv$K.opt, scale.x=FALSE ))
This is the Lymphoma Gene Expression dataset used in Chung and Keles (2010).
data(lymphoma)
data(lymphoma)
A list with two components:
Gene expression data. A matrix with 62 rows and 4026 columns.
Class index. A vector with 62 elements.
The lymphoma dataset consists of 42 samples of diffuse large B-cell lymphoma (DLBCL),
9 samples of follicular lymphoma (FL),
and 11 samples of chronic lymphocytic leukemia (CLL).
DBLCL, FL, and CLL classes are coded in 0, 1, and 2, respectively, in y
vector.
Matrix x
is gene expression data and
arrays were normalized, imputed, log transformed, and standardized
to zero mean and unit variance across genes as described
in Dettling (2004) and Dettling and Beuhlmann (2002).
See Chung and Keles (2010) for more details.
Alizadeh A, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, and Staudt LM (2000), "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling", Nature, Vol. 403, pp. 503–511.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
Dettling M (2004), "BagBoosting for tumor classification with gene expression data", Bioinformatics, Vol. 20, pp. 3583–3593.
Dettling M and Beuhlmann P (2002), "Supervised clustering of genes", Genome Biology, Vol. 3, pp. research0069.1–0069.15.
data(lymphoma) lymphoma$x[1:5,1:5] lymphoma$y
data(lymphoma) lymphoma$x[1:5,1:5] lymphoma$y
This is the Mice dataset used in Chun and Keles (2010).
data(mice)
data(mice)
A list with two components:
Marker map data. A matrix with 60 rows and 145 columns.
Gene expression data. A matrix with 60 rows and 83 columns.
The Mice dataset was published by Lan et al. (2006). Matrix x
is
the marker map consisting of 145 microsatellite markers from 19 non-sex mouse chromosomes.
Matrix y
is gene expression measurements of the 83 transcripts
from liver tissues of 60 mice. This group of the 83 transcripts is one of the clusters
analyzed by Chun and Keles (2010). See Chun and Keles (2010) for more details.
Lan H, Chen M, Flowers JB, Yandell BS, Stapleton DS, Mata CM, Mui E, Flowers MT, Schueler KL, Manly KF, Williams RW, Kendziorski C, and Attie AD (2006), "Combined expression trait correlations and expression quantitative trait locus mapping", PLoS Genetics, Vol. 2, e6.
Chun H and Keles S (2009), "Expression quantitative trait loci mapping with multivariate sparse partial least squares regression", Genetics, Vol. 182, pp. 79–90.
data(mice) mice$x[1:5,1:5] mice$y[1:5,1:5]
data(mice) mice$x[1:5,1:5] mice$y[1:5,1:5]
Provide the coefficient path plot of SPLS regression as a function of the number of hidden components (K) when eta is fixed.
## S3 method for class 'spls' plot( x, yvar=c(1:ncol(x$y)), ... )
## S3 method for class 'spls' plot( x, yvar=c(1:ncol(x$y)), ... )
x |
A fitted SPLS object. |
yvar |
Index vector of responses to be plotted. |
... |
Other parameters to be passed through to generic |
plot.spls
provides the coefficient path plot of SPLS fits.
The plot shows how estimated coefficients change
as a function of the number of hidden components (K
),
when eta
is fixed at the value used by the original SPLS fit.
NULL.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
print.spls
, predict.spls
,
and coef.spls
.
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Draw coefficient path plots for the first two responses plot( f, yvar=c(1:2) )
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Draw coefficient path plots for the first two responses plot( f, yvar=c(1:2) )
Make predictions or extract coefficients from a fitted SGPLS object.
## S3 method for class 'sgpls' predict( object, newx, type = c("fit","coefficient"), fit.type = c("class","response"), ... ) ## S3 method for class 'sgpls' coef( object, ... )
## S3 method for class 'sgpls' predict( object, newx, type = c("fit","coefficient"), fit.type = c("class","response"), ... ) ## S3 method for class 'sgpls' coef( object, ... )
object |
A fitted SGPLS object. |
newx |
If |
type |
If |
fit.type |
If |
... |
Any arguments for |
Users can input either only selected variables or all variables for newx
.
Matrix of coefficient estimates if type="coefficient"
.
Matrix of predicted responses if type="fit"
(responses will be predicted classes if fit.type="class"
or predicted probabilities if fit.type="response"
).
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
data(prostate) # SGPLS with eta=0.55 & 3 hidden components f <- sgpls( prostate$x, prostate$y, K=3, eta=0.55, scale.x=FALSE ) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ] # Prediction on the training dataset (pred.f <- predict( f, type="fit" ))
data(prostate) # SGPLS with eta=0.55 & 3 hidden components f <- sgpls( prostate$x, prostate$y, K=3, eta=0.55, scale.x=FALSE ) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ] # Prediction on the training dataset (pred.f <- predict( f, type="fit" ))
Make predictions or extract coefficients from a fitted SPLS object.
## S3 method for class 'spls' predict( object, newx, type = c("fit","coefficient"), ... ) ## S3 method for class 'spls' coef( object, ... )
## S3 method for class 'spls' predict( object, newx, type = c("fit","coefficient"), ... ) ## S3 method for class 'spls' coef( object, ... )
object |
A fitted SPLS object. |
newx |
If |
type |
If |
... |
Any arguments for |
Users can input either only selected variables or all variables for newx
.
Matrix of coefficient estimates if type="coefficient"
.
Matrix of predicted responses if type="fit"
.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
plot.spls
and print.spls
.
data(yeast) # SPLS with eta=0.7 & 8 latent components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Coefficient estimates of the SPLS fit coef.f <- coef(f) coef.f[1:5,] # Prediction on the training dataset pred.f <- predict( f, type="fit" ) pred.f[1:5,]
data(yeast) # SPLS with eta=0.7 & 8 latent components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) # Coefficient estimates of the SPLS fit coef.f <- coef(f) coef.f[1:5,] # Prediction on the training dataset pred.f <- predict( f, type="fit" ) pred.f[1:5,]
Make predictions or extract coefficients from a fitted SPLSDA object.
## S3 method for class 'splsda' predict( object, newx, type = c("fit","coefficient"), fit.type = c("class","response"), ... ) ## S3 method for class 'splsda' coef( object, ... )
## S3 method for class 'splsda' predict( object, newx, type = c("fit","coefficient"), fit.type = c("class","response"), ... ) ## S3 method for class 'splsda' coef( object, ... )
object |
A fitted SPLSDA object. |
newx |
If |
type |
If |
fit.type |
If |
... |
Any arguments for |
Users can input either only selected variables or all variables for newx
.
Matrix of coefficient estimates if type="coefficient"
.
Matrix of predicted responses if type="fit"
(responses will be predicted classes if fit.type="class"
or predicted probabilities if fit.type="response"
).
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ] # Prediction on the training dataset (pred.f <- predict( f, type="fit" ))
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ] # Prediction on the training dataset (pred.f <- predict( f, type="fit" ))
Print out SGPLS fit, the number and the list of selected predictors.
## S3 method for class 'sgpls' print( x, ... )
## S3 method for class 'sgpls' print( x, ... )
x |
A fitted SGPLS object. |
... |
Additonal arguments for generic |
NULL.
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
predict.sgpls
and coef.sgpls
.
data(prostate) # SGPLS with eta=0.55 & 3 hidden components f <- sgpls( prostate$x, prostate$y, K=3, eta=0.55, scale.x=FALSE ) print(f)
data(prostate) # SGPLS with eta=0.55 & 3 hidden components f <- sgpls( prostate$x, prostate$y, K=3, eta=0.55, scale.x=FALSE ) print(f)
Print out SPLS fit, the number and the list of selected predictors.
## S3 method for class 'spls' print( x, ... )
## S3 method for class 'spls' print( x, ... )
x |
A fitted SPLS object. |
... |
Additonal arguments for generic |
NULL.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection," Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
plot.spls
, predict.spls
,
and coef.spls
.
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) print(f)
data(yeast) # SPLS with eta=0.7 & 8 hidden components f <- spls( yeast$x, yeast$y, K=8, eta=0.7 ) print(f)
Print out SPLSDA fits, the number and the list of selected predictors.
## S3 method for class 'splsda' print( x, ... )
## S3 method for class 'splsda' print( x, ... )
x |
A fitted SPLSDA object. |
... |
Additonal arguments for generic |
NULL.
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
predict.splsda
and coef.splsda
.
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) print(f)
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) print(f)
This is the Prostate Tumor Gene Expression dataset used in Chung and Keles (2010).
data(prostate)
data(prostate)
A list with two components:
Gene expression data. A matrix with 102 rows and 6033 columns.
Class index. A vector with 102 elements.
The prostate dataset consists of 52 prostate tumor and 50 normal samples.
Normal and tumor classes are coded in 0 and 1, respectively, in y
vector.
Matrix x
is gene expression data and
arrays were normalized, log transformed, and standardized
to zero mean and unit variance across genes as described
in Dettling (2004) and Dettling and Beuhlmann (2002).
See Chung and Keles (2010) for more details.
Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, DAmico A, Richie J, Lander E, Loda M, Kantoff P, Golub T, and Sellers W (2002), "Gene expression correlates of clinical prostate cancer behavior", Cancer Cell, Vol. 1, pp. 203–209.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
Dettling M (2004), "BagBoosting for tumor classification with gene expression data", Bioinformatics, Vol. 20, pp. 3583–3593.
Dettling M and Beuhlmann P (2002), "Supervised clustering of genes", Genome Biology, Vol. 3, pp. research0069.1–0069.15.
data(prostate) prostate$x[1:5,1:5] prostate$y
data(prostate) prostate$x[1:5,1:5] prostate$y
Fit a SGPLS classification model.
sgpls( x, y, K, eta, scale.x=TRUE, eps=1e-5, denom.eps=1e-20, zero.eps=1e-5, maxstep=100, br=TRUE, ftype='iden' )
sgpls( x, y, K, eta, scale.x=TRUE, eps=1e-5, denom.eps=1e-20, zero.eps=1e-5, maxstep=100, br=TRUE, ftype='iden' )
x |
Matrix of predictors. |
y |
Vector of class indices. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
eps |
An effective zero for change in estimates. Default is 1e-5. |
denom.eps |
An effective zero for denominators. Default is 1e-20. |
zero.eps |
An effective zero for success probabilities. Default is 1e-5. |
maxstep |
Maximum number of Newton-Raphson iterations. Default is 100. |
br |
Apply Firth's bias reduction procedure? |
ftype |
Type of Firth's bias reduction procedure.
Alternatives are |
The SGPLS method is described in detail in Chung and Keles (2010).
SGPLS provides PLS-based classification with variable selection,
by incorporating sparse partial least squares (SPLS) proposed in Chun and Keles (2010)
into a generalized linear model (GLM) framework.
y
is assumed to have numerical values, 0, 1, ..., G,
where G is the number of classes subtracted by one.
A sgpls
object is returned.
print, predict, coef methods use this object.
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
print.sgpls
, predict.sgpls
, and coef.sgpls
.
data(prostate) # SGPLS with eta=0.6 & 3 hidden components (f <- sgpls(prostate$x, prostate$y, K=3, eta=0.6, scale.x=FALSE)) # Print out coefficients coef.f <- coef(f) coef.f[coef.f!=0, ]
data(prostate) # SGPLS with eta=0.6 & 3 hidden components (f <- sgpls(prostate$x, prostate$y, K=3, eta=0.6, scale.x=FALSE)) # Print out coefficients coef.f <- coef(f) coef.f[coef.f!=0, ]
Fit a SPLS regression model.
spls( x, y, K, eta, kappa=0.5, select="pls2", fit="simpls", scale.x=TRUE, scale.y=FALSE, eps=1e-4, maxstep=100, trace=FALSE)
spls( x, y, K, eta, kappa=0.5, select="pls2", fit="simpls", scale.x=TRUE, scale.y=FALSE, eps=1e-4, maxstep=100, trace=FALSE)
x |
Matrix of predictors. |
y |
Vector or matrix of responses. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
kappa |
Parameter to control the effect of
the concavity of the objective function
and the closeness of original and surrogate direction vectors.
|
select |
PLS algorithm for variable selection.
Alternatives are |
fit |
PLS algorithm for model fitting. Alternatives are
|
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
scale.y |
Scale responses by dividing each response variable by its sample standard deviation? |
eps |
An effective zero. Default is 1e-4. |
maxstep |
Maximum number of iterations when fitting direction vectors. Default is 100. |
trace |
Print out the progress of variable selection? |
The SPLS method is described in detail in Chun and Keles (2010).
SPLS directly imposes sparsity on the dimension reduction step of PLS
in order to achieve accurate prediction and variable selection simultaneously.
The option select
refers to the PLS algorithm for variable selection.
The option fit
refers to the PLS algorithm for model fitting
and spls
utilizes algorithms offered by the pls package for this purpose.
See help files of the function plsr
in the pls package for more details.
The user should install the pls package before using spls functions.
The choices for select
and fit
are independent.
A spls object is returned. print, plot, predict, coef, ci.spls, coefplot.spls methods use this object.
Dongjun Chung, Hyonho Chun, and Sunduz Keles.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
print.spls
, plot.spls
, predict.spls
,
coef.spls
, ci.spls
, and coefplot.spls
.
data(yeast) # SPLS with eta=0.7 & 8 hidden components (f <- spls(yeast$x, yeast$y, K=8, eta=0.7)) # Print out coefficients coef.f <- coef(f) coef.f[,1] # Coefficient path plot plot(f, yvar=1) dev.new() # Coefficient plot of selected variables coefplot.spls(f, xvar=c(1:4))
data(yeast) # SPLS with eta=0.7 & 8 hidden components (f <- spls(yeast$x, yeast$y, K=8, eta=0.7)) # Print out coefficients coef.f <- coef(f) coef.f[,1] # Coefficient path plot plot(f, yvar=1) dev.new() # Coefficient plot of selected variables coefplot.spls(f, xvar=c(1:4))
Fit a SPLSDA classification model.
splsda( x, y, K, eta, kappa=0.5, classifier=c('lda','logistic'), scale.x=TRUE, ... )
splsda( x, y, K, eta, kappa=0.5, classifier=c('lda','logistic'), scale.x=TRUE, ... )
x |
Matrix of predictors. |
y |
Vector of class indices. |
K |
Number of hidden components. |
eta |
Thresholding parameter. |
kappa |
Parameter to control the effect of
the concavity of the objective function
and the closeness of original and surrogate direction vectors.
|
classifier |
Classifier used in the second step of SPLSDA.
Alternatives are |
scale.x |
Scale predictors by dividing each predictor variable by its sample standard deviation? |
... |
Other parameters to be passed through to |
The SPLSDA method is described in detail in Chung and Keles (2010).
SPLSDA provides a two-stage approach for PLS-based classification with variable selection,
by directly imposing sparsity on the dimension reduction step of PLS
using sparse partial least squares (SPLS) proposed in Chun and Keles (2010).
y
is assumed to have numerical values, 0, 1, ..., G,
where G is the number of classes subtracted by one.
The option classifier
refers to the classifier used in the second step of SPLSDA
and splsda
utilizes algorithms offered by MASS and nnet packages
for this purpose.
If classifier="logistic"
, then either logistic regression or multinomial regression is used.
Linear discriminant analysis (LDA) is used if classifier="lda"
.
splsda
also utilizes algorithms offered by the pls package for fitting spls
.
The user should install pls, MASS and nnet packages before using splsda
functions.
A splsda
object is returned.
print, predict, coef methods use this object.
Dongjun Chung and Sunduz Keles.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
print.splsda
, predict.splsda
, and coef.splsda
.
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) print(f) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ]
data(prostate) # SPLSDA with eta=0.8 & 3 hidden components f <- splsda( prostate$x, prostate$y, K=3, eta=0.8, scale.x=FALSE ) print(f) # Print out coefficients coef.f <- coef(f) coef.f[ coef.f!=0, ]
This is the Yeast Cell Cycle dataset used in Chun and Keles (2010).
data(yeast)
data(yeast)
A list with two components:
ChIP-chip data. A matrix with 542 rows and 106 columns.
Cell cycle gene expression data. A matrix with 542 rows and 18 columns.
Matrix y
is cell cycle gene expression data (Spellman et al., 1998)
of 542 genes from an factor based experiment.
Each column corresponds to mRNA levels
measured at every 7 minutes during 119 minutes (a total of 18 measurements).
Matrix
x
is the chromatin immunoprecipitation on chip (ChIP-chip) data of
Lee et al. (2002) and it contains the binding information for 106
transcription factors. See Chun and Keles (2010) for more details.
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thomson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, and Young RA (2002), "Transcriptional regulatory networks in Saccharomyces cerevisiae", Science, Vol. 298, pp. 799–804.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, and Futcher B (1998), "Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hydrization", Molecular Biology of the Cell, Vol. 9, pp. 3273–3279.
Chun H and Keles S (2010), "Sparse partial least squares for simultaneous dimension reduction and variable selection", Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
data(yeast) yeast$x[1:5,1:5] yeast$y[1:5,1:5]
data(yeast) yeast$x[1:5,1:5] yeast$y[1:5,1:5]