Title:  Prediction and Variable Selection Using Spike and Slab Regression 

Description:  Spike and slab for prediction and variable selection in linear regression models. Uses a generalized elastic net for variable selection. 
Authors:  Hemant Ishwaran <[email protected]> 
Maintainer:  Udaya B. Kogalur <[email protected]> 
License:  GPL (>= 3) 
Version:  1.1.6 
Built:  20240320 07:57:45 UTC 
Source:  CRAN 
Computes the Kfold crossvalidated mean squared prediction error for the generalized elastic net from spike and slab regression. Returns a stability index for each variable.
cv.spikeslab(x = NULL, y = NULL, K = 10,
plot.it = TRUE, n.iter1 = 500, n.iter2 = 500, mse = TRUE,
bigp.smalln = FALSE, bigp.smalln.factor = 1, screen = (bigp.smalln),
r.effects = NULL, max.var = 500, center = TRUE, intercept = TRUE,
fast = TRUE, beta.blocks = 5, verbose = TRUE, save.all = TRUE,
ntree = 300, seed = NULL, ...)
x 
xpredictor matrix. 
y 
yresponse values. 
K 
Number of folds. 
plot.it 
If TRUE, plots the mean prediction error and its standard error. 
n.iter1 
Number of burnin Gibbs sampled values (i.e., discarded values). 
n.iter2 
Number of Gibbs sampled values, following burnin. 
mse 
If TRUE, an external estimate for the overall variance is calculated. 
bigp.smalln 
Use if 
bigp.smalln.factor 
Top 
screen 
If TRUE, variables are first prefiltered. 
r.effects 
List used for grouping variables (see details below). 
max.var 
Maximum number of variables allowed in the final model. 
center 
If TRUE, variables are centered by their means. Default is TRUE and should only be adjusted in extreme examples. 
intercept 
If TRUE, an intercept is included in the model, otherwise no intercept is included. Default is TRUE. 
fast 
If TRUE, use blocked Gibbs sampling to accelerate the algorithm. 
beta.blocks 
Update beta using this number of blocks ( 
verbose 
If TRUE, verbose output is sent to the terminal. 
save.all 
If TRUE, spikeslab object for each fold is saved and returned. 
ntree 
Number of trees used by random forests (applies only when 
seed 
Seed for random number generator. Must be a negative integer. 
... 
Further arguments passed to or from other methods. 
Invisibly returns a list with components:
spikeslab.obj 
Spike and slab object from the full data. 
cv.spikeslab.obj 
List containing spike and slab objects from each fold. Can be NULL. 
cv.fold 
List containing the cv splits. 
cv 
Meansquared error for each fold for the gnet. 
cv.path 
A matrix of meansquared errors for the gnet solution path. Rows correspond to model sizes, columns are the folds. 
stability 
Matrix containing stability for each variable defined as the percentage of times a variable is identified over the Kfolds. Also includes bma and gnet coefficient values and their cvfoldaveraged values. 
bma 
bma coefficients from the full data in terms of the standardized x. 
bma.scale 
bma coefficients from the full data, scaled in terms of the original x. 
gnet 
cvoptimized gnet in terms of the standardized x. 
gnet.scale 
cvoptimized gnet in terms of the original x. 
gnet.model 
List of models selected by gnet over the Kfolds. 
gnet.path 
gnet path from the full data, scaled in terms of the original x. 
gnet.obj 
gnet object from fitting the full data (a larstype object). 
gnet.obj.vars 
Variables (in order) used to calculate the gnet object. 
verbose 
Verbose details (used for printing). 
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730773.
Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.
Ishwaran H. and Rao J.S. (2011). Mixing generalized ridge regressions.
sparsePC.spikeslab
,
plot.spikeslab
,
predict.spikeslab
,
print.spikeslab
.
## Not run:
#
# Example 1: 10fold validation using parallel processing
#
data(ozoneI, package = "spikeslab")
y < ozoneI[, 1]
x < ozoneI[, 1]
cv.obj < cv.spikeslab(x = x, y = y, parallel = 4)
plot(cv.obj, plot.type = "cv")
plot(cv.obj, plot.type = "path")
#
# Example 2: 10fold validation using parallel processing
# (high dimensional diabetes data)
#
# add 2000 noise variables
data(diabetesI, package = "spikeslab")
diabetes.noise < cbind(diabetesI,
noise = matrix(rnorm(nrow(diabetesI) * 2000), nrow(diabetesI)))
x < diabetes.noise[, 1]
y < diabetes.noise[, 1]
cv.obj < cv.spikeslab(x = x, y = y, bigp.smalln=TRUE, parallel = 4)
plot(cv.obj)
## End(Not run)
The data consists of 442 patients in which the response of interest is a
quantitative measure of disease progression. The data includes 10
baseline measurements for each patient, in addition to 45 interactions
and 9 quadratic terms, for a total of 64 variables for each patient.
The outcome is Y
.
Efron B., Hastie T., Johnstone. I and Tibshirani R. (2004). Least angle regression (with discussion). Ann. Statist., 32:407499.
data(diabetesI, package = "spikeslab")
Median house price for 506 census tracts of Boston from the 1970 census.
The original data comprises 506 observations and 13 variables but has
been modified here to include all pairwise interactions of main effects
and to include Bspline basis functions of up to 6 degrees of freedom
for all original predictors. In addition, all real valued variables
were mapped to dummy variables representing a factor with three levels
and all pairwise interactions of these dummy variables were added to the
design matrix. In total, the modified data contains 506 observations
and 658 variables. The outcome is the median house price medv
.
Harrison D. and Rubinfeld D.L. (1978). Hedonic prices and the demand for clean air. J. Envir. Economics Management, 5:81102
data(housingI, package = "spikeslab")
Gene expression cancer data set (Golub et al.) of samples from human acute myeloid (AML) and acute lymphoblastic leukemias (ALL). 3571 expression values on 72 individuals.
Golub T.R et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531537.
data(leukemia, package = "spikeslab")
The data consists of 366 readings of maximum daily ozone measured in the
Los Angeles basin. After removing missing values, the original data has
been expanded to include all pairwise interactions, as well as Bspline
basis functions (6 degrees of freedom), for each of the original 12
variables (9 meteorlogical variables and 3 variables recording date of
measurement: month, day of the month, and day of week). In total, the
modified data has 203 observations and 134 variables. The outcome is
ozone
.
Breiman L. and Friedman J.H. (1985). Estimating optimal transformations for multiple regression and correlation. J. Amer. Stat. Assoc., 80:580598.
data(ozoneI, package = "spikeslab")
Plots either the gnet solution path or the crossvalidated meansquarederror (the latter only applies when crossvalidation is used).
## S3 method for class 'spikeslab'
plot(x, plot.type = c("path", "cv"), breaks = FALSE, ...)
x 
An object of class 
plot.type 
Choosing "path" produces a plot of the gnet solution
path. The choice "cv" produces the meansquared error plot. The
latter applies only to objects from a 
breaks 
If TRUE, then vertical lines are drawn at each break point in the gnet solution path. 
... 
Further arguments passed to or from other methods. 
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Efron B., Hastie T., Johnstone I., and Tibshirani R. (2004). Least angle regression (with discussion). Ann. Statist., 32:407499.
Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.
spikeslab, cv.spikeslab
.
## Not run:
#
# Example 1: diabetes data
#
data(diabetesI, package = "spikeslab")
obj < spikeslab(Y ~ . , diabetesI, verbose = TRUE)
plot(obj, plot.type = "path")
## End(Not run)
Prediction on test data using spike and slab regression.
## S3 method for class 'spikeslab'
predict(object, newdata = NULL, ...)
object 
An object of class 
newdata 
Data frame or xmatrix containing test data (if omitted, the training data is used). 
... 
Further arguments passed to or from other methods. 
Computes the predicted value using a test data set.
A vector of fitted values for the BMA and gnet and a matrix of fitted values for the gnet path. Also returns the grr mixing predictor if the object has been parsed by the mixing wrapper.
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Ishwaran H. and Rao J.S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Stat. Assoc., 98:438455.
Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730773.
Ishwaran H. and Rao J.S. (2005b). Spike and slab gene selection for multigroup microarray data. J. Amer. Stat. Assoc., 100:764780.
Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.
Ishwaran H., Kogalur U.B. and Rao J.S. (2010). spikeslab: prediction and variable selection using spike and slab regression. R Journal, 2(2), 6873.
Ishwaran H. and Rao J.S. (2011). Mixing generalized ridge regressions.
spikeslab
.
## Not run:
#
# Example 1: get the predictor for the training data
#
data(diabetesI, package = "spikeslab")
x < diabetesI[, 1]
y < diabetesI[, 1]
obj < spikeslab(x = x, y = y)
#gnet predictor
yhat.gnet < predict(obj)$yhat.gnet
#an equivalent call is...
yhat.gnet < predict(obj, x = x)$yhat.gnet
#
# Example 2: ozone data with interactions
#
data(ozoneI, package = "spikeslab")
train.pt < sample(1:nrow(ozoneI), nrow(ozoneI) * 0.80)
obj < spikeslab(ozone ~ . , ozoneI[train.pt, ])
ytest < ozoneI$ozone[train.pt]
ss.pred < predict(obj, ozoneI[train.pt, ])
yhat.bma < ss.pred$yhat.bma
yhat.gnet < ss.pred$yhat.gnet
plot(ytest, yhat.bma, ylab = "yhat", pch = 16, col = 4)
points(ytest, yhat.gnet, pch = 16, col = 2)
abline(0, 1, lty = 2, col = 2)
legend("bottomright", legend = c("bma", "gnet"), col = c(4, 2), pch = 16)
## End(Not run)
Print summary output from spike and slab analysis. Note that this is the default print method for the package.
## S3 method for class 'spikeslab'
print(x, ...)
x 
An object of class 
... 
Further arguments passed to or from other methods. 
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Ishwaran H. and Rao J.S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Stat. Assoc., 98:438455.
Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730773.
Ishwaran H. and Rao J.S. (2005b). Spike and slab gene selection for multigroup microarray data. J. Amer. Stat. Assoc., 100:764780.
Ishwaran H. and Rao J.S. (2009). Generalized ridge regression: geometry and computational solutions when p is larger than n.
spikeslab
.
## Not run:
#
# Example 1: diabetes data
#
data(diabetesI, package = "spikeslab")
obj < spikeslab(Y ~ . , diabetesI, verbose = TRUE)
print(obj)
## End(Not run)
Variable selection for the multiclass gene prediction problem.
sparsePC.spikeslab(x = NULL, y = NULL, n.rep = 10,
n.iter1 = 150, n.iter2 = 100, n.prcmp = 5, max.genes = 100,
ntree = 1000, nodesize = 1, verbose = TRUE, ...)
x 
x matrix of gene expressions. 
y 
Class labels. 
n.rep 
Number of Monte Carlo replicates. 
n.iter1 
Number of burnin Gibbs sampled values (i.e., discarded values). 
n.iter2 
Number of Gibbs sampled values, following burnin. 
n.prcmp 
Number of principal components. 
max.genes 
Maximum number of genes in final model. 
ntree 
Number of trees used by random forests. 
nodesize 
Nodesize of trees. 
verbose 
If TRUE, verbose output is sent to the terminal. 
... 
Further arguments passed to or from other methods. 
Multiclass prediction using a hybrid combination of spike and slab
linear regression and random forest multiclass prediction (Ishwaran
and Rao, 2009). A pseudo yvector of response values is calculated
using each of the top n.prcmp
principal components of the
xgene expression matrix. The generalized elastic net obtained from
using spike and slab regression is used to select genes; one
regression fit is used for each of the pseduo yresponse vectors. The
final combined set of genes are passed to random forests and used to
construct a multiclass forest predictor. This procedure is repeated
n.rep
times with each Monte Carlo replicate based on balanced
crossvalidation with 2/3rds of the data used for training and 1/3rd
used for testing.
—> Miscellanea:
Test set error is only computed when n.rep
is larger than 1.
If n.rep
=1 the full data is used without any crossvalidation.
Invisibly, the final set of selected genes as well as the complete set
of genes selected over the n.rep
Monte Carlo replications. The
random forest classifier is also returned.
The misclassification error rate, error rate for each class, and other summary information are output to the terminal.
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Ishwaran H. and Rao J.S. (2009). Generalized ridge regression: geometry and computational solutions when p is larger than n.
spikeslab
.
## Not run:
#
# Example 1: leukemia data
#
data(leukemia, package = "spikeslab")
sparsePC.out < sparsePC(x = leukemia[, 1], y = leukemia[, 1], n.rep = 3)
rf.obj < sparsePC.out$rf.obj
varImpPlot(rf.obj)
## End(Not run)
Fits a rescaled spike and slab model using a continuous bimodal prior. A generalized elastic net estimator is used for variable selection and estimation. Can be used for prediction and variable selection in low and highdimensional linear regression models.
spikeslab(formula, data = NULL, x = NULL, y = NULL,
n.iter1 = 500, n.iter2 = 500, mse = TRUE,
bigp.smalln = FALSE, bigp.smalln.factor = 1, screen = (bigp.smalln),
r.effects = NULL, max.var = 500, center = TRUE, intercept = TRUE,
fast = TRUE, beta.blocks = 5, verbose = FALSE, ntree = 300,
seed = NULL, ...)
formula 
A symbolic description of the model to be fit. 
data 
Data frame containing the data used in the formula. 
x 
x predictor matrix (can be used in place of formula and data frame call). 
y 
y response (can be used in place of formula and data frame call). 
n.iter1 
Number of burnin Gibbs sampled values (i.e., discarded values). 
n.iter2 
Number of Gibbs sampled values, following burnin. 
mse 
If TRUE, an external estimate for the overall variance is calculated using ridge regression or random forests (the latter is used when the degrees of freedom are low). Otherwise, the variance is included in the prior and estimated using Gibbs sampling. 
bigp.smalln 
Use if 
bigp.smalln.factor 
Removes all variables except the top

screen 
If TRUE, variables are prefiltered. 
r.effects 
List used for grouping variables (see details below). 
max.var 
Maximum number of variables allowed in the final model. 
center 
If TRUE, variables are centered by their means. Default is TRUE and should only be adjusted in extreme examples. 
intercept 
If TRUE, an intercept is included in the model, otherwise no intercept is included. Default is TRUE. 
fast 
If TRUE, use blocked Gibbs sampling to accelerate the algorithm. 
beta.blocks 
Update beta using this number of blocks ( 
verbose 
If TRUE, verbose output is sent to the terminal. 
ntree 
Number of trees used by random forests (applies only when 
seed 
Seed for random number generator. Must be a negative integer. 
... 
Further arguments passed to or from other methods. 
—> General:
The spike and slab method is described in detail in Ishwaran and Rao
(2003, 2005a, 2005b and 2009). For highdimensional problems in which
p
>> n
, where p
is the number of variables and
n
is the sample size, use the option bigp.smalln
=TRUE.
Doing so implements a threestage procedure:
(1) Filtering step. This removes all variables except the top
n
times bigp.smalln.factor
ones. Uses spike and slab
regression with grouped regularization (complexity) parameters.
(2) Model averaging step. Refit the model using only those predictors from step 1. Returns the posterior mean values from fitting a spike and slab model; referred to as the Bayesian model averaged (bma) estimate.
(3) Variable selection step. Select variables using the generalized elastic net (gnet).
The filtering step is omitted when bigp.smalln
=FALSE.
Filtering can however be requested by setting screen
=TRUE
although users should be aware that this may degrade performance and
should only be used when p
is on the same order of n
.
Variables can be grouped using r.effects
. Grouping has the
effect of forcing variables within a given group to share a common
complexity (regularization) parameter. To do so, define a list with
each entry in the list made up of the variable names to be grouped.
There is no limit to the number of groups. Any variable that does
not appear in the list will be assigned to a default group (the
default group also has its own groupspecific regularization
parameter). See Examples 1 and 3 below.
—> Miscellanea:
By default, fast
=TRUE when bigp.smalln
=TRUE. This
invokes an ultrafast filtering step. Setting fast
=FALSE
invokes a more thorough filtering method that may slightly improve
inferential results, but computational times will become very slow.
The tradeoff is unlikely to be justified.
The formula and dataframe call should be avoided in highdimensional problems and instead the xpredictor matrix and y response vector should be passed directly (see Example 3). This avoids the huge overhead in parsing formula in R.
By default, predictors are normalized to have mean 0 and variance 1.
Preprocessing also involves centering y unless the user specifically
requests that the intercept be excluded from the model. Users can
also override centering predictors by setting center
=FALSE.
Use with extreme care.
The verbose
option sends output to the terminal showing the
number of Gibbs iterations and the current complexity (regularization)
parameter(s).
Depends on the randomForest
package for estimating the variance
when mse
=TRUE. Note that mse
is overridden and set to
FALSE when bigp.smalln
=TRUE.
Depends on the lars
package for the variable slection step.
An object of class spikeslab
with the following components:
summary 
Summary object. 
verbose 
Verbose details (used for printing). 
terms 
Terms. 
sigma.hat 
Estimated variance. 
y 
Original y. 
xnew 
Centered, rescaled xmatrix. 
x 
Original xmatrix. 
y.center 
Centering for original y. 
x.center 
Centering for original xmatrix. 
x.scale 
Scaling for original xmatrix. 
names 
Variable names. 
bma 
bma coefficients in terms of xnew. 
bma.scale 
bma coefficients rescaled in terms of original x. 
gnet 
gnet coefficients in terms of xnew. 
gnet.scale 
gnet coefficients rescaled in terms of original x. 
gnet.path 
gnet path scaled in terms of the original x. 
gnet.obj 
gnet object (a larstype object). 
gnet.obj.vars 
Variables (in order) used to calculate the gnet object. 
gnet.parms 
Generalized ridge regression parameters used to define the gnet. 
phat 
Estimated model dimension. 
complexity 
Complexity (regularization) parameter estimates. 
ridge 
List containing ridge values used to determine the bma. 
models 
List containing the models sampled. 
Hemant Ishwaran ([email protected])
J. Sunil Rao ([email protected])
Udaya B. Kogalur ([email protected])
Breiman L. (2001). Random forests, Machine Learning, 45:532.
Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004). Least angle regression (with discussion). Ann. Statist., 32:407499.
Ishwaran H. and Rao J.S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Stat. Assoc., 98:438455.
Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730773.
Ishwaran H. and Rao J.S. (2005b). Spike and slab gene selection for multigroup microarray data. J. Amer. Stat. Assoc., 100:764780.
Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.
Ishwaran H., Kogalur U.B. and Rao J.S. (2010). spikeslab: prediction and variable selection using spike and slab regression. R Journal, 2(2), 6873.
Ishwaran H. and Rao J.S. (2011). Mixing generalized ridge regressions.
Zou H. and Hastie T. (2005). Regularization and variable selection via the elastic net. J. Royal Statist. Society B, 67(2):301320.
cv.spikeslab
,
plot.spikeslab
,
predict.spikeslab
,
print.spikeslab
,
sparsePC.spikeslab
.
#
# Example 1: diabetes data
#
# basic call
data(diabetesI, package = "spikeslab")
obj < spikeslab(Y ~ . , diabetesI, verbose=TRUE)
print(obj)
plot(obj)
# grouping effect
# separate main effects and interactions into two groups
# use a groupspecific regularization parameter for each group
xnames < names(diabetesI[, 1])
r.eff < vector("list", 2)
r.eff[[1]] < xnames[c(1:10)]
r.eff[[2]] < xnames[c(1:10)]
obj2 < spikeslab(Y ~ . , diabetesI, verbose=TRUE, r.effects=r.eff)
obj2
# extract the regularization parameters
print(apply(obj2$complexity, 2, summary))
## Not run:
#
# Example 2: highdimensional noise (diabetes data)
#
# add 2000 noise variables
data(diabetesI, package = "spikeslab")
diabetes.noise < cbind(diabetesI,
noise = matrix(rnorm(nrow(diabetesI) * 2000), nrow(diabetesI)))
# example of a big p, small n call
# don't use formula call; make call with x and y arguments
x < diabetes.noise[, 1]
y < diabetes.noise[, 1]
obj < spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE, max.var=100)
obj
# same example ... but now group variables
r.eff < vector("list", 2)
r.eff[[1]] < names(x)[c(1:100)]
r.eff[[2]] < names(x)[c(1:100)]
obj2 < spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE,
r.effects=r.eff, max.var=100)
obj2
#
# Example 3: housing data with interactions
#
# another example of a big p, small n call
data(housingI, package = "spikeslab")
obj < spikeslab(medv ~ ., housingI, verbose = TRUE,
bigp.smalln = TRUE, max.var = 200)
print(obj)
## End(Not run)
Show the NEWS file of the spikeslab package.
spikeslab.news(...)
... 
Further arguments passed to or from other methods. 
None.
Hemant Ishwaran [email protected]