--- title: "biglasso" author: "Yaohui Zeng, Chuyi Wang, Patrick Breheny" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{biglasso} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=6, fig.height=4 ) ``` # Small Data ## Linear regression ```{r} library(biglasso) data(colon) X <- colon$X y <- colon$y dim(X) ``` ```{r} X[1:5, 1:5] ``` ```{r} ## convert X to a big.matrix object ## X.bm is a pointer to the data matrix X.bm <- as.big.matrix(X) str(X.bm) ``` ```{r} dim(X.bm) ``` ```{r} X.bm[1:5, 1:5] ## same results as X[1:5, 1:5] ``` ```{r} ## fit entire solution path, using our newly proposed "Adaptive" screening rule (default) fit <- biglasso(X.bm, y) plot(fit) ``` ## Cross-Validation ```{r} ## 10-fold cross-valiation in parallel cvfit <- tryCatch( { cv.biglasso(X.bm, y, seed = 1234, nfolds = 10, ncores = 4) }, error = function(cond) { cv.biglasso(X.bm, y, seed = 1234, nfolds = 10, ncores = 2) } ) ``` After cross-validation, a few things we can do: - plot the cross-validation plots: ```{r} par(mfrow = c(2, 2), mar = c(3.5, 3.5, 3, 1) ,mgp = c(2.5, 0.5, 0)) plot(cvfit, type = "all") ``` - Summarize CV object: ```{r} summary(cvfit) ``` - Extract non-zero coefficients at the optimal $\lambda$ value: ```{r} coef(cvfit)[which(coef(cvfit) != 0),] ``` ## Logistic Regression ```{r} data(Heart) X <- Heart$X y <- Heart$y X.bm <- as.big.matrix(X) fit <- biglasso(X.bm, y, family = "binomial") plot(fit) ``` ## Cox Regression ```{r} library(survival) X <- heart[,4:7] y <- Surv(heart$stop - heart$start, heart$event) X.bm <- as.big.matrix(X) fit <- biglasso(X.bm, y, family = "cox") plot(fit) ``` ## Multiple response Linear Regression (multi-task learning) ```{r} set.seed(10101) n=300; p=300; m=5; s=10; b=1 x = matrix(rnorm(n * p), n, p) beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m) y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m) x.bm = as.big.matrix(x) fit = biglasso(x.bm, y, family = "mgaussian") plot(fit) ``` # Big Data When the raw data file is very large, it's better to convert the raw data file into a file-backed `big.matrix` by using a file cache. We can call function `setupX`, which reads the raw data file and creates a backing file (.bin) and a descriptor file (.desc) for the raw data matrix: ```{r} ## The data has 200 observations, 600 features, and 10 non-zero coefficients. ## This is not actually very big, but vignettes in R are supposed to render ## quickly. Much larger data can be handled in the same way. if(!file.exists('BigX.bin')) { X <- matrix(rnorm(1000 * 5000), 1000, 5000) beta <- c(-5:5) y <- as.numeric(X[,1:11] %*% beta) write.csv(X, "BigX.csv", row.names = F) write.csv(y, "y.csv", row.names = F) ## Pretend that the data in "BigX.csv" is too large to fit into memory X.bm <- setupX("BigX.csv", header = T) } ``` It's important to note that the above operation is just one-time execution. Once done, the data can always be retrieved seamlessly by attaching its descriptor file (.desc) in any new R session: ```{r, warning=F} rm(list = c("X", "X.bm", "y")) # Pretend starting a new session X.bm <- attach.big.matrix("BigX.desc") y <- read.csv("y.csv")[,1] ``` This is very appealing for big data analysis in that we don't need to "read" the raw data again in a R session, which would be very time-consuming. The code below again fits a lasso-penalized linear model, and runs 10-fold cross-validation: ```{r} system.time({fit <- biglasso(X.bm, y)}) ``` ```{r} plot(fit) ``` ```{r} # 10-fold cross validation in parallel tryCatch( { system.time({cvfit <- cv.biglasso(X.bm, y, seed = 1234, ncores = 4, nfolds = 10)}) }, error = function(cond) { system.time({cvfit <- cv.biglasso(X.bm, y, seed = 1234, ncores = 2, nfolds = 10)}) } ) ``` ```{r} par(mfrow = c(2, 2), mar = c(3.5, 3.5, 3, 1), mgp = c(2.5, 0.5, 0)) plot(cvfit, type = "all") ``` # Links * [biglasso on CRAN](https://cran.r-project.org/package=biglasso) * [biglasso on GitHub](https://github.com/pbreheny/biglasso) * [biglasso website](https://pbreheny.github.io/biglasso/index.html) * [big.matrix manipulation](https://cran.r-project.org/package=bigmemory/index.html)