Title: | Extending Lasso Model Fitting to Big Data |
---|---|
Description: | Extend lasso and elastic-net model fitting for ultra high-dimensional, multi-gigabyte data sets that cannot be loaded into memory. Designed to be more memory- and computation-efficient than existing lasso-fitting packages like 'glmnet' and 'ncvreg', thus allowing the user to analyze big data analysis even on an ordinary laptop. |
Authors: | Yaohui Zeng [aut], Chuyi Wang [aut], Tabitha Peter [aut], Patrick Breheny [aut, cre] |
Maintainer: | Patrick Breheny <[email protected]> |
License: | GPL-3 |
Version: | 1.6.0 |
Built: | 2024-10-31 22:19:30 UTC |
Source: | CRAN |
Extend lasso and elastic-net linear, logistic and cox regression models for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into available RAM. This package utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory- and computation-efficient and highly scalable as compared to existing lasso-fitting packages such as glmnet and ncvreg, thus allowing for powerful big data analysis even with only an ordinary laptop.
Package: | biglasso |
Type: | Package |
Version: | 1.4-1 |
Date: | 2021-01-29 |
License: | GPL-3 |
Penalized regression models, in particular the lasso, have been extensively applied to analyzing high-dimensional data sets. However, due to the memory limit, existing R packages are not capable of fitting lasso models for ultrahigh-dimensional, multi-gigabyte data sets which have been increasingly seen in many areas such as genetics, biomedical imaging, genome sequencing and high-frequency finance.
This package aims to fill the gap by extending lasso model fitting to Big
Data in R. Version >= 1.2-3 represents a major redesign where the source
code is converted into C++ (previously in C), and new feature screening
rules, as well as OpenMP parallel computing, are implemented. Some key
features of biglasso
are summarized as below:
it utilizes memory-mapped files to store the massive data on the disk, only loading data into memory when necessary during model fitting. Consequently, it's able to seamlessly data-larger-than-RAM cases.
it is built upon pathwise coordinate descent algorithm with warm start, active set cycling, and feature screening strategies, which has been proven to be one of fastest lasso solvers.
in incorporates our newly developed hybrid and adaptive screening that outperform state-of-the-art screening rules such as the sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) with additional 1.5x to 4x speedup.
the implementation is designed to be as
memory-efficient as possible by eliminating extra copies of the data created
by other R packages, making it at least 2x more memory-efficient than
glmnet
.
the underlying computation is implemented in C++, and parallel computing with OpenMP is also supported.
For more information:
Benchmarking results: https://github.com/pbreheny/biglasso
Tutorial: https://pbreheny.github.io/biglasso/articles/biglasso.html
Technical paper: https://arxiv.org/abs/1701.05936
The input design matrix X must be a bigmemory::big.matrix()
object.
This can be created by the function as.big.matrix
in the R package
bigmemory.
If the data (design matrix) is very large (e.g. 10 GB) and stored in an external
file, which is often the case for big data, X can be created by calling the
function setupX()
.
In this case, there are several restrictions about the data file:
the data file must be a well-formated ASCII-file, with each row corresponding to an observation and each column a variable;
the data file must contain only one single type. Current version only
supports double
type;
the data file must contain only numeric variables. If there are categorical variables, the user needs to create dummy variables for each categorical varable (by adding additional columns).
Future versions will try to address these restrictions.
Denote the number of observations and variables be, respectively, n
and p
. It's worth noting that the package is more suitable for wide
data (ultrahigh-dimensional, p >> n
) as compared to long data
(n >> p
). This is because the model fitting algorithm takes advantage
of sparsity assumption of high-dimensional data. To just give the user some
ideas, below are some benchmarking results of the total computing time (in
seconds) for solving lasso-penalized linear regression along a sequence of
100 values of the tuning parameter. In all cases, assume 20 non-zero
coefficients equal +/- 2 in the true model. (Based on Version 1.2-3,
screening rule "SSR-BEDPP" is used)
For wide data case (p > n
), n = 1,000
:
p |
1,000 | 10,000 | 100,000 | 1,000,000 |
Size of X |
9.5 MB | 95 MB | 950 MB | 9.5 GB |
Elapsed time (s) | 0.11 | 0.83 | 8.47 | 85.50 |
%
For long data case (n >> p
), p = 1,000
:
%
%n |
1,000 | 10,000 | 100,000 | 1,000,000 |
%Size of X |
9.5 MB | 95 MB | 950 MB | 9.5 GB |
%Elapsed time (s) | 2.50 | 11.43 | 83.69 | 1090.62 |
% |
Yaohui Zeng, Chuyi Wang, Tabitha Peter, and Patrick Breheny
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. https://arxiv.org/abs/1701.05936.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 245-266.
Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pp. 1070-1078.
Xiang, Z. J., and Ramadge, P. J. (2012). Fast lasso screening tests based on correlations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 2137-2140). IEEE.
Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014). A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pp. 1053-1061.
## Not run: ## Example of reading data from external big data file, fit lasso model, ## and run cross validation in parallel # simulated design matrix, 1000 observations, 500,000 variables, ~ 5GB # there are 10 true variables with non-zero coefficient 2. xfname <- 'x_e3_5e5.txt' yfname <- 'y_e3_5e5.txt' # response vector time <- system.time( X <- setupX(xfname, sep = '\t') # create backing files (.bin, .desc) ) print(time) # ~ 7 minutes; this is just one-time operation dim(X) # the big.matrix then can be retrieved by its descriptor file (.desc) in any new R session. rm(X) xdesc <- 'x_e3_5e5.desc' X <- attach.big.matrix(xdesc) dim(X) y <- as.matrix(read.table(yfname, header = F)) time.fit <- system.time( fit <- biglasso(X, y, family = 'gaussian', screen = 'Hybrid') ) print(time.fit) # ~ 44 seconds for fitting a lasso model along the entire solution path # cross validation in parallel seed <- 1234 time.cvfit <- system.time( cvfit <- cv.biglasso(X, y, family = 'gaussian', screen = 'Hybrid', seed = seed, ncores = 4, nfolds = 10) ) print(time.cvfit) # ~ 3 minutes for 10-fold cross validation plot(cvfit) summary(cvfit) ## End(Not run)
## Not run: ## Example of reading data from external big data file, fit lasso model, ## and run cross validation in parallel # simulated design matrix, 1000 observations, 500,000 variables, ~ 5GB # there are 10 true variables with non-zero coefficient 2. xfname <- 'x_e3_5e5.txt' yfname <- 'y_e3_5e5.txt' # response vector time <- system.time( X <- setupX(xfname, sep = '\t') # create backing files (.bin, .desc) ) print(time) # ~ 7 minutes; this is just one-time operation dim(X) # the big.matrix then can be retrieved by its descriptor file (.desc) in any new R session. rm(X) xdesc <- 'x_e3_5e5.desc' X <- attach.big.matrix(xdesc) dim(X) y <- as.matrix(read.table(yfname, header = F)) time.fit <- system.time( fit <- biglasso(X, y, family = 'gaussian', screen = 'Hybrid') ) print(time.fit) # ~ 44 seconds for fitting a lasso model along the entire solution path # cross validation in parallel seed <- 1234 time.cvfit <- system.time( cvfit <- cv.biglasso(X, y, family = 'gaussian', screen = 'Hybrid', seed = seed, ncores = 4, nfolds = 10) ) print(time.cvfit) # ~ 3 minutes for 10-fold cross validation plot(cvfit) summary(cvfit) ## End(Not run)
Extend lasso model fitting to big data that cannot be loaded into memory. Fit solution paths for linear, logistic or Cox regression models penalized by lasso, ridge, or elastic-net over a grid of values for the regularization parameter lambda.
biglasso( X, y, row.idx = 1:nrow(X), penalty = c("lasso", "ridge", "enet"), family = c("gaussian", "binomial", "cox", "mgaussian"), alg.logistic = c("Newton", "MM"), screen = c("Adaptive", "SSR", "Hybrid", "None"), safe.thresh = 0, update.thresh = 1, ncores = 1, alpha = 1, lambda.min = ifelse(nrow(X) > ncol(X), 0.001, 0.05), nlambda = 100, lambda.log.scale = TRUE, lambda, eps = 1e-07, max.iter = 1000, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE, verbose = FALSE )
biglasso( X, y, row.idx = 1:nrow(X), penalty = c("lasso", "ridge", "enet"), family = c("gaussian", "binomial", "cox", "mgaussian"), alg.logistic = c("Newton", "MM"), screen = c("Adaptive", "SSR", "Hybrid", "None"), safe.thresh = 0, update.thresh = 1, ncores = 1, alpha = 1, lambda.min = ifelse(nrow(X) > ncol(X), 0.001, 0.05), nlambda = 100, lambda.log.scale = TRUE, lambda, eps = 1e-07, max.iter = 1000, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE, verbose = FALSE )
X |
The design matrix, without an intercept. It must be a
double type |
y |
The response vector for |
row.idx |
The integer vector of row indices of |
penalty |
The penalty to be applied to the model. Either |
family |
Either |
alg.logistic |
The algorithm used in logistic regression. If "Newton" then the exact hessian is used (default); if "MM" then a majorization-minimization algorithm is used to set an upper-bound on the hessian matrix. This can be faster, particularly in data-larger-than-RAM case. |
screen |
The feature screening rule used at each |
safe.thresh |
the threshold value between 0 and 1 that controls when to stop safe test. For example, 0.01 means to stop safe test at next lambda iteration if the number of features rejected by safe test at current lambda iteration is not larger than 1\ to always turn off safe test, whereas 0 (default) means to turn off safe test if the number of features rejected by safe test is 0 at current lambda. |
update.thresh |
the non negative threshold value that controls how often to update the reference of safe rules for "Adaptive" methods. Smaller value means updating more often. |
ncores |
The number of OpenMP threads used for parallel computing. |
alpha |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as
|
lambda.min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
The number of lambda values. Default is 100. |
lambda.log.scale |
Whether compute the grid values of lambda on log scale (default) or linear scale. |
lambda |
A user-specified sequence of lambda values. By default, a
sequence of values of length |
eps |
Convergence threshold for inner coordinate descent. The
algorithm iterates until the maximum change in the objective after any
coefficient update is less than |
max.iter |
Maximum number of iterations. Default is 1000. |
dfmax |
Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients. |
penalty.factor |
A multiplicative factor for the penalty applied to
each coefficient. If supplied, |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
output.time |
Whether to print out the start and end time of the model fitting. Default is FALSE. |
return.time |
Whether to return the computing time of the model fitting. Default is TRUE. |
verbose |
Whether to output the timing of each lambda iteration. Default is FALSE. |
The objective function for linear regression or multiple responses linear regression
(family = "gaussian"
or family = "mgaussian"
) is
where for family = "mgaussian"
), a group-lasso type penalty is applied.
For logistic regression
(family = "binomial"
) it is
, for cox regression, breslow approximation for ties is applied.
Several advanced feature screening rules are implemented. For
lasso-penalized linear regression, all the options of screen
are
applicable. Our proposal adaptive rule - "Adaptive"
- achieves highest speedup
so it's the recommended one, especially for ultrahigh-dimensional large-scale
data sets. For cox regression and/or the elastic net penalty, only
"SSR"
is applicable for now. More efficient rules are under development.
An object with S3 class "biglasso"
for
"gaussian", "binomial", "cox"
families, or an object with S3 class
"mbiglasso"
for "mgaussian"
family, with following variables.
beta |
The fitted matrix of coefficients, store in sparse matrix
representation. The number of rows is equal to the number of coefficients,
whereas the number of columns is equal to |
iter |
A vector of length |
lambda |
The sequence of regularization parameter values in the path. |
penalty |
Same as above. |
family |
Same as above. |
alpha |
Same as above. |
loss |
A vector containing either the residual sum of squares
(for |
penalty.factor |
Same as above. |
n |
The number of observations used in the model fitting. It's equal to
|
center |
The sample mean vector of the variables, i.e., column mean of
the sub-matrix of |
scale |
The sample standard deviation of the variables, i.e., column
standard deviation of the sub-matrix of |
y |
The response vector used in the model fitting. Depending on
|
screen |
Same as above. |
col.idx |
The indices of features that have 'scale' value greater than 1e-6. Features with 'scale' less than 1e-6 are removed from model fitting. |
rejections |
The number of features rejected at each value of |
safe_rejections |
The number of features rejected by safe rules at each
value of |
Yaohui Zeng, Chuyi Wang and Patrick Breheny
biglasso-package, setupX()
, cv.biglasso()
, plot.biglasso()
, ncvreg::ncvreg()
## Linear regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) # lasso, default par(mfrow=c(1,2)) fit.lasso <- biglasso(X.bm, y, family = 'gaussian') plot(fit.lasso, log.l = TRUE, main = 'lasso') # elastic net fit.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = 'gaussian') plot(fit.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5') ## Logistic regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) # lasso, default par(mfrow = c(1, 2)) fit.bin.lasso <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") plot(fit.bin.lasso, log.l = TRUE, main = 'lasso') # elastic net fit.bin.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = "binomial") plot(fit.bin.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5') ## Cox regression set.seed(10101) N <- 1000; p <- 30; nzc <- p/3 X <- matrix(rnorm(N * p), N, p) beta <- rnorm(nzc) fx <- X[, seq(nzc)] %*% beta/3 hx <- exp(fx) ty <- rexp(N, hx) tcens <- rbinom(n = N, prob = 0.3, size = 1) # censoring indicator y <- cbind(time = ty, status = 1 - tcens) # y <- Surv(ty, 1 - tcens) with library(survival) X.bm <- as.big.matrix(X) fit <- biglasso(X.bm, y, family = "cox") plot(fit, main = "cox") ## Multiple responses linear regression set.seed(10101) n=300; p=300; m=5; s=10; b=1 x = matrix(rnorm(n * p), n, p) beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m) y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m) x.bm = as.big.matrix(x) fit = biglasso(x.bm, y, family = "mgaussian") plot(fit, main = "mgaussian")
## Linear regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) # lasso, default par(mfrow=c(1,2)) fit.lasso <- biglasso(X.bm, y, family = 'gaussian') plot(fit.lasso, log.l = TRUE, main = 'lasso') # elastic net fit.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = 'gaussian') plot(fit.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5') ## Logistic regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) # lasso, default par(mfrow = c(1, 2)) fit.bin.lasso <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") plot(fit.bin.lasso, log.l = TRUE, main = 'lasso') # elastic net fit.bin.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = "binomial") plot(fit.bin.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5') ## Cox regression set.seed(10101) N <- 1000; p <- 30; nzc <- p/3 X <- matrix(rnorm(N * p), N, p) beta <- rnorm(nzc) fx <- X[, seq(nzc)] %*% beta/3 hx <- exp(fx) ty <- rexp(N, hx) tcens <- rbinom(n = N, prob = 0.3, size = 1) # censoring indicator y <- cbind(time = ty, status = 1 - tcens) # y <- Surv(ty, 1 - tcens) with library(survival) X.bm <- as.big.matrix(X) fit <- biglasso(X.bm, y, family = "cox") plot(fit, main = "cox") ## Multiple responses linear regression set.seed(10101) n=300; p=300; m=5; s=10; b=1 x = matrix(rnorm(n * p), n, p) beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m) y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m) x.bm = as.big.matrix(x) fit = biglasso(x.bm, y, family = "mgaussian") plot(fit, main = "mgaussian")
This function is intended for users who know exactly what they're doing and want complete control over the fitting process. It
does NOT add an intercept
does NOT standardize the design matrix
does NOT set up a path for lambda (the lasso tuning parameter) all of the above are critical steps in data analysis. However, a direct API has been provided for use in situations where the lasso fitting process is an internal component of a more complicated algorithm and standardization must be handled externally.
biglasso_fit( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE )
biglasso_fit( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE )
X |
The design matrix, without an intercept. It must be a
double type |
y |
The response vector |
r |
Residuals (length n vector) corresponding to |
init |
Initial values for beta. Default: zero (length p vector) |
xtx |
X scales: the jth element should equal |
penalty |
String specifying which penalty to use. Default is 'lasso', Other options are 'SCAD' and 'MCP' (the latter are non-convex) |
lambda |
A single value for the lasso tuning parameter. |
alpha |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as:
|
gamma |
Tuning parameter value for nonconvex penalty. Defaults are
3.7 for |
ncores |
The number of OpenMP threads used for parallel computing. |
max.iter |
Maximum number of iterations. Default is 1000. |
eps |
Convergence threshold for inner coordinate descent. The
algorithm iterates until the maximum change in the objective
after any coefficient update is less than |
dfmax |
Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients. |
penalty.factor |
A multiplicative factor for the penalty applied to
each coefficient. If supplied, |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
output.time |
Whether to print out the start and end time of the model fitting. Default is FALSE. |
return.time |
Whether to return the computing time of the model fitting. Default is TRUE. |
Note:
Hybrid safe-strong rules are turned off for biglasso_fit()
, as these rely
on standardization
Currently, the function only works with linear regression
(family = 'gaussian'
).
An object with S3 class "biglasso"
with following variables.
beta |
The vector of estimated coefficients |
iter |
A vector of length |
resid |
Vector of residuals calculated from estimated coefficients. |
lambda |
The sequence of regularization parameter values in the path. |
alpha |
Same as in |
loss |
A vector containing either the residual sum of squares of the fitted model at each value of lambda. |
penalty.factor |
Same as in |
n |
The number of observations used in the model fitting. |
y |
The response vector used in the model fitting. |
Tabitha Peter and Patrick Breheny
data(Prostate) X <- cbind(1, Prostate$X) xtx <- apply(X, 2, crossprod)/nrow(X) y <- Prostate$y X.bm <- as.big.matrix(X) init <- rep(0, ncol(X)) fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000) fit$beta fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, penalty='MCP', lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000) fit$beta
data(Prostate) X <- cbind(1, Prostate$X) xtx <- apply(X, 2, crossprod)/nrow(X) y <- Prostate$y X.bm <- as.big.matrix(X) init <- rep(0, ncol(X)) fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000) fit$beta fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, penalty='MCP', lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000) fit$beta
This function is intended for users who know exactly what they're doing and want complete control over the fitting process. It
does NOT add an intercept
does NOT standardize the design matrix both of the above are critical steps in data analysis. However, a direct API has been provided for use in situations where the lasso fitting process is an internal component of a more complicated algorithm and standardization must be handled externally.
biglasso_path( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE )
biglasso_path( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE )
X |
The design matrix, without an intercept. It must be a
double type |
y |
The response vector |
r |
Residuals (length n vector) corresponding to |
init |
Initial values for beta. Default: zero (length p vector) |
xtx |
X scales: the jth element should equal |
penalty |
String specifying which penalty to use. Default is 'lasso', Other options are 'SCAD' and 'MCP' (the latter are non-convex) |
lambda |
A vector of numeric values the lasso tuning parameter. |
alpha |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as:
|
gamma |
Tuning parameter value for nonconvex penalty. Defaults are
3.7 for |
ncores |
The number of OpenMP threads used for parallel computing. |
max.iter |
Maximum number of iterations. Default is 1000. |
eps |
Convergence threshold for inner coordinate descent. The
algorithm iterates until the maximum change in the objective
after any coefficient update is less than |
dfmax |
Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients. |
penalty.factor |
A multiplicative factor for the penalty applied to
each coefficient. If supplied, |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
output.time |
Whether to print out the start and end time of the model fitting. Default is FALSE. |
return.time |
Whether to return the computing time of the model fitting. Default is TRUE. |
biglasso_path()
works identically to biglasso_fit()
except it offers the
additional option of fitting models across a path of tuning parameter values.
Note:
Hybrid safe-strong rules are turned off for biglasso_fit()
, as these rely
on standardization
Currently, the function only works with linear regression
(family = 'gaussian'
).
An object with S3 class "biglasso"
with following variables.
beta |
A sparse matrix where rows are estimates a given coefficient across all values of lambda |
iter |
A vector of length |
resid |
Vector of residuals calculated from estimated coefficients. |
lambda |
The sequence of regularization parameter values in the path. |
alpha |
Same as in |
loss |
A vector containing either the residual sum of squares of the fitted model at each value of lambda. |
penalty.factor |
Same as in |
n |
The number of observations used in the model fitting. |
y |
The response vector used in the model fitting. |
Tabitha Peter and Patrick Breheny
data(Prostate) X <- cbind(1, Prostate$X) xtx <- apply(X, 2, crossprod)/nrow(X) y <- Prostate$y X.bm <- as.big.matrix(X) init <- rep(0, ncol(X)) fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter=2000) fit$beta fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty='MCP', penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 2000) fit$beta
data(Prostate) X <- cbind(1, Prostate$X) xtx <- apply(X, 2, crossprod)/nrow(X) y <- Prostate$y X.bm <- as.big.matrix(X) init <- rep(0, ncol(X)) fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter=2000) fit$beta fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty='MCP', penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 2000) fit$beta
The data file contains gene expression data of 62 samples (40 tumor samples, 22 normal samples) from colon-cancer patients analyzed with an Affymetrix oligonucleotide Hum6000 array.
A list of 2 variables included in colon
:
X
: a 62-by-2000 matrix that records the gene expression data.
Used as design matrix.
y
: a binary vector of length 62 recording the sample status: 1 =
tumor; 0 = normal. Used as response vector.
The raw data can be found on Bioconductor: https://bioconductor.org/packages/release/data/experiment/html/colonCA.html.
U. Alon et al. (1999): Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96, 6745-6750. https://www.pnas.org/doi/abs/10.1073/pnas.96.12.6745.
data(colon) X <- colon$X y <- colon$y str(X) dim(X) X.bm <- as.big.matrix(X, backingfile = "") # convert to big.matrix object str(X.bm) dim(X.bm)
data(colon) X <- colon$X y <- colon$y str(X) dim(X) X.bm <- as.big.matrix(X, backingfile = "") # convert to big.matrix object str(X.bm) dim(X.bm)
Perform k-fold cross validation for penalized regression models over a grid of values for the regularization parameter lambda.
cv.biglasso( X, y, row.idx = 1:nrow(X), family = c("gaussian", "binomial", "cox", "mgaussian"), eval.metric = c("default", "MAPE", "auc", "class"), ncores = parallel::detectCores(), ..., nfolds = 5, seed, cv.ind, trace = FALSE, grouped = TRUE )
cv.biglasso( X, y, row.idx = 1:nrow(X), family = c("gaussian", "binomial", "cox", "mgaussian"), eval.metric = c("default", "MAPE", "auc", "class"), ncores = parallel::detectCores(), ..., nfolds = 5, seed, cv.ind, trace = FALSE, grouped = TRUE )
X |
The design matrix, without an intercept, as in |
y |
The response vector, as in |
row.idx |
The integer vector of row indices of |
family |
Either |
eval.metric |
The evaluation metric for the cross-validated error and
for choosing optimal |
ncores |
The number of cores to use for parallel execution of the
cross-validation folds, run on a cluster created by the |
... |
Additional arguments to |
nfolds |
The number of cross-validation folds. Default is 5. |
seed |
The seed of the random number generator in order to obtain reproducible results. |
cv.ind |
Which fold each observation belongs to. By default the
observations are randomly assigned by |
trace |
If set to TRUE, cv.biglasso will inform the user of its progress by announcing the beginning of each CV fold. Default is FALSE. |
grouped |
Whether to calculate CV standard error ( |
The function calls biglasso
nfolds
times, each time leaving
out 1/nfolds
of the data. The cross-validation error is based on the
residual sum of squares when family="gaussian"
and the binomial
deviance when family="binomial"
.
The S3 class object cv.biglasso
inherits class ncvreg::cv.ncvreg()
. So S3
functions such as "summary", "plot"
can be directly applied to the
cv.biglasso
object.
An object with S3 class "cv.biglasso"
which inherits from
class "cv.ncvreg"
. The following variables are contained in the
class (adopted from ncvreg::cv.ncvreg()
).
cve |
The error for each value of |
cvse |
The estimated standard error associated with each value of for |
lambda |
The sequence of regularization parameter values along which the cross-validation error was calculated. |
fit |
The fitted |
min |
The index of |
lambda.min |
The value of |
lambda.1se |
The largest value of |
null.dev |
The deviance for the intercept-only model. |
pe |
If |
cv.ind |
Same as above. |
Yaohui Zeng and Patrick Breheny
biglasso()
, plot.cv.biglasso()
, summary.cv.biglasso()
, setupX()
## Not run: ## cv.biglasso data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) ## logistic regression cvfit <- cv.biglasso(X.bm, y, family = 'binomial', seed = 1234, ncores = 2) par(mfrow = c(2, 2)) plot(cvfit, type = 'all') summary(cvfit) ## End(Not run)
## Not run: ## cv.biglasso data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X) ## logistic regression cvfit <- cv.biglasso(X.bm, y, family = 'binomial', seed = 1234, ncores = 2) par(mfrow = c(2, 2)) plot(cvfit, type = 'all') summary(cvfit) ## End(Not run)
Produce a plot of the coefficient paths for a fitted biglasso()
object.
## S3 method for class 'biglasso' plot(x, alpha = 1, log.l = TRUE, ...)
## S3 method for class 'biglasso' plot(x, alpha = 1, log.l = TRUE, ...)
x |
Fitted |
alpha |
Controls alpha-blending, helpful when the number of covariates is large. Default is alpha=1. |
log.l |
Should horizontal axis be on the log scale? Default is TRUE. |
... |
Other graphical parameters to |
Yaohui Zeng and Patrick Breheny
## See examples in "biglasso"
## See examples in "biglasso"
Plot the cross-validation curve from a cv.biglasso()
object,
along with standard error bars.
## S3 method for class 'cv.biglasso' plot( x, log.l = TRUE, type = c("cve", "rsq", "scale", "snr", "pred", "all"), selected = TRUE, vertical.line = TRUE, col = "red", ... )
## S3 method for class 'cv.biglasso' plot( x, log.l = TRUE, type = c("cve", "rsq", "scale", "snr", "pred", "all"), selected = TRUE, vertical.line = TRUE, col = "red", ... )
x |
A |
log.l |
Should horizontal axis be on the log scale? Default is TRUE. |
type |
What to plot on the vertical axis. |
selected |
If |
vertical.line |
If |
col |
Controls the color of the dots (CV estimates). |
... |
Other graphical parameters to |
Error bars representing approximate 68\
along with the estimates at value of lambda
. For rsq
and
snr
, these confidence intervals are quite crude, especially near.
Yaohui Zeng and Patrick Breheny
## See examples in "cv.biglasso"
## See examples in "cv.biglasso"
Produce a plot of the coefficient paths for a fitted multiple responses
mbiglasso
object.
## S3 method for class 'mbiglasso' plot(x, alpha = 1, log.l = TRUE, norm.beta = TRUE, ...)
## S3 method for class 'mbiglasso' plot(x, alpha = 1, log.l = TRUE, norm.beta = TRUE, ...)
x |
Fitted |
alpha |
Controls alpha-blending, helpful when the number of covariates is large. Default is alpha=1. |
log.l |
Should horizontal axis be on the log scale? Default is TRUE. |
norm.beta |
Should the vertical axis be the l2 norm of coefficients for each variable? Default is TRUE. If False, the vertical axis is the coefficients. |
... |
Other graphical parameters to |
Chuyi Wang
## See examples in "biglasso"
## See examples in "biglasso"
biglasso
objectExtract predictions (fitted reponse, coefficients, etc.) from a
fitted biglasso()
object.
## S3 method for class 'biglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... ) ## S3 method for class 'mbiglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), k = 1, ... ) ## S3 method for class 'biglasso' coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...) ## S3 method for class 'mbiglasso' coef(object, lambda, which = 1:length(object$lambda), intercept = TRUE, ...)
## S3 method for class 'biglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... ) ## S3 method for class 'mbiglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), k = 1, ... ) ## S3 method for class 'biglasso' coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...) ## S3 method for class 'mbiglasso' coef(object, lambda, which = 1:length(object$lambda), intercept = TRUE, ...)
object |
A fitted |
X |
Matrix of values at which predictions are to be made. It must be a
|
row.idx |
Similar to that in |
type |
Type of prediction:
|
lambda |
Values of the regularization parameter |
which |
Indices of the penalty parameter |
... |
Not used. |
k |
Index of the response to predict in multiple responses regression (
|
drop |
If coefficients for a single value of |
intercept |
Whether the intercept should be included in the returned
coefficients. For |
The object returned depends on type
.
Yaohui Zeng and Patrick Breheny
## Logistic regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X, backingfile = "") fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") coef <- coef(fit, lambda=0.05, drop = TRUE) coef[which(coef != 0)] predict(fit, X.bm, type="link", lambda=0.05)[1:10] predict(fit, X.bm, type="response", lambda=0.05)[1:10] predict(fit, X.bm, type="class", lambda=0.1)[1:10] predict(fit, type="vars", lambda=c(0.05, 0.1)) predict(fit, type="nvars", lambda=c(0.05, 0.1))
## Logistic regression data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X, backingfile = "") fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") coef <- coef(fit, lambda=0.05, drop = TRUE) coef[which(coef != 0)] predict(fit, X.bm, type="link", lambda=0.05)[1:10] predict(fit, X.bm, type="response", lambda=0.05)[1:10] predict(fit, X.bm, type="class", lambda=0.1)[1:10] predict(fit, type="vars", lambda=c(0.05, 0.1)) predict(fit, type="nvars", lambda=c(0.05, 0.1))
cv.biglasso()
objectExtract predictions from a fitted cv.biglasso()
object.
## S3 method for class 'cv.biglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda = object$lambda.min, which = object$min, ... ) ## S3 method for class 'cv.biglasso' coef(object, lambda = object$lambda.min, which = object$min, ...)
## S3 method for class 'cv.biglasso' predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda = object$lambda.min, which = object$min, ... ) ## S3 method for class 'cv.biglasso' coef(object, lambda = object$lambda.min, which = object$min, ...)
object |
A fitted |
X |
Matrix of values at which predictions are to be made. It must be a
|
row.idx |
Similar to that in |
type |
Type of prediction:
|
lambda |
Values of the regularization parameter |
which |
Indices of the penalty parameter |
... |
Not used. |
The object returned depends on type
.
Yaohui Zeng and Patrick Breheny
## Not run: ## predict.cv.biglasso data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X, backingfile = "") fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") cvfit <- cv.biglasso(X.bm, y, penalty = 'lasso', family = "binomial", seed = 1234, ncores = 2) coef <- coef(cvfit) coef[which(coef != 0)] predict(cvfit, X.bm, type = "response") predict(cvfit, X.bm, type = "link") predict(cvfit, X.bm, type = "class") predict(cvfit, X.bm, lambda = "lambda.1se") ## End(Not run)
## Not run: ## predict.cv.biglasso data(colon) X <- colon$X y <- colon$y X.bm <- as.big.matrix(X, backingfile = "") fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial") cvfit <- cv.biglasso(X.bm, y, penalty = 'lasso', family = "binomial", seed = 1234, ncores = 2) coef <- coef(cvfit) coef[which(coef != 0)] predict(cvfit, X.bm, type = "response") predict(cvfit, X.bm, type = "link") predict(cvfit, X.bm, type = "class") predict(cvfit, X.bm, lambda = "lambda.1se") ## End(Not run)
Set up the design matrix X as a big.matrix
object based on external
massive data file stored on disk that cannot be fullly loaded into memory.
The data file must be a well-formated ASCII-file, and contains only one
single type. Current version only supports double
type. Other
restrictions about the data file are described in
biglasso-package. This function reads the massive data, and
creates a big.matrix
object. By default, the resulting
big.matrix
is file-backed, and can be shared across processors or
nodes of a cluster.
setupX( filename, dir = getwd(), sep = ",", backingfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".bin"), descriptorfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".desc"), type = "double", ... )
setupX( filename, dir = getwd(), sep = ",", backingfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".bin"), descriptorfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".desc"), type = "double", ... )
filename |
The name of the data file. For example, "dat.txt". |
dir |
The directory used to store the binary and descriptor files
associated with the |
sep |
The field separator character. For example, "," for comma-delimited files (the default); "\t" for tab-delimited files. |
backingfile |
The binary file associated with the file-backed
|
descriptorfile |
The descriptor file used for the description of the
file-backed |
type |
The data type. Only "double" is supported for now. |
... |
Additional arguments that can be passed into function
|
For a data set, this function needs to be called only one time to set up the
big.matrix
object with two backing files (.bin, .desc) created in
current working directory. Once set up, the data can be "loaded" into any
(new) R session by calling attach.big.matrix(discriptorfile)
.
This function is a simple wrapper of bigmemory::read.big.matrix()
. See
bigmemory for more
details.
A big.matrix
object corresponding to a file-backed
bigmemory::big.matrix()
. It's ready to be used as the design matrix X
in
biglasso()
and cv.biglasso()
.
Yaohui Zeng and Patrick Breheny
biglasso()
, cv.ncvreg()
, biglasso-package
## see the example in "biglasso-package"
## see the example in "biglasso-package"
Summary method for cv.biglasso
objects.
## S3 method for class 'cv.biglasso' summary(object, ...) ## S3 method for class 'summary.cv.biglasso' print(x, digits, ...)
## S3 method for class 'cv.biglasso' summary(object, ...) ## S3 method for class 'summary.cv.biglasso' print(x, digits, ...)
object |
A |
... |
Further arguments passed to or from other methods. |
x |
A |
digits |
Number of digits past the decimal point to print out. Can be a vector specifying different display digits for each of the five non-integer printed values. |
summary.cv.biglasso
produces an object with S3 class
"summary.cv.biglasso"
. The class has its own print method and contains
the following list elements:
penalty |
The penalty used by |
model |
Either |
n |
Number of observations |
p |
Number of regression coefficients (not including the intercept). |
min |
The index of |
lambda |
The sequence of |
cve |
Cross-validation error (deviance). |
r.squared |
Proportion of variance explained by the model, as estimated by cross-validation. |
snr |
Signal to noise ratio, as estimated by cross-validation. |
sigma |
For linear regression models, the scale parameter estimate. |
pe |
For logistic regression models, the prediction error (misclassification error). |
Yaohui Zeng and Patrick Breheny
biglasso()
, cv.biglasso()
, plot.cv.biglasso()
, biglasso-package
## See examples in "cv.biglasso" and "biglasso-package"
## See examples in "cv.biglasso" and "biglasso-package"