--- title: "Best Orthogonalized Subset Selection (BOSS)" author: "Sen Tian" date: "`r Sys.Date()`" #output: rmarkdown::html_vignette output: pdf_document vignette: > %\VignetteIndexEntry{Best Orthogonalized Subset Selection (BOSS)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - id: Tian2021 title: On the Use of Information Criteria for Subset Selection in Least Squares Regression author: - family: Tian given: Sen - family: Hurvich given: Clifford M. - family: Simonoff given: Jeffrey S. URL: 'https://arxiv.org/abs/1911.10191' issued: year: 2021 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Installation We maintain a Github page for the package and keep the most updated version there. To install, simply run the following commands in the console: ```{r, eval=FALSE} library(devtools) install_github(repo="sentian/BOSSreg", subdir="r-package") ``` A stable version can be installed from CRAN using ```{r, eval=FALSE} install.packages(repo="BOSSreg", repos = "http://cran.us.r-project.org") ``` ## Introduction BOSS is a least squares-based subset selection method. It is based on takes the following steps: * order the predictors based on their partial correlations with the response; * perform best subset regression upon the orthogonal basis of the ordered predictors; * transform the coefficients back to the original space; * choose the optimal solution using the selection rule AICc-hdf. The hdf is a heuristic degrees of freedom for BOSS that can be plugged into a selection rule such as AICc-hdf, which can then be used as a selection rule for BOSS. AICc-hdf is defined as \begin{equation*} \text{AICc-hdf} = n \log\left(\frac{\text{RSS}}{n}\right) + n \frac{n+\text{hdf}}{n-{hdf}-2}. \end{equation*} More details can be referred to @Tian2021. This vignette is structured as follows. We start by simulating a dataset. We then introduce the components, functionalities and basic usage of the package. This is followed by a discussion contrasting BOSS and forward stepwise regression (FS). Finally, we study real data examples and compare BOSS with some popular regularization methods. Note that this vignette is based on **R-3.6.1**. Slightly different results may be obtained using pre-3.6.0 versions of **R** since the default underlying random number generator has been changed in version 3.6.0. ## Simulated datasets The model generating mechanism is $y=X\beta+\epsilon$. We consider a sparse model where only a few predictors matter, with a high signal-to-noise ratio. The detailed parameters are given as follows: ```{r} n = 200 # Number of observations p = 14 # Number of predictors p0 = 6 # Number of active predictors (beta_j=0) rho = 0.9 # Correlation between predictors nrep = 1000 # Number of replications of y to be generated SNR = 7 # Signal-to-noise ratio seed = 65 # The seed for reproducibility ``` We make the predictors with $\beta_j \ne 0$ pairwisely correlated with opposite effects. We generate $1000$ replicated datasets where the response $y$ is generated with fixed $X$. The columns of $X$ and $y$ are constructed to have zero mean, so we can exclude the intercept term from model fitting. ```{r} library(MASS) # Function to generate the data # Columns of X have mean 0 and norm 1, y has mean 0 simu.data <- function(n, p, p0, rho, nrep, SNR, seed){ # True beta beta = rep(0,p) beta = c(rep(c(1,-1),p0/2), rep(0,p-p0)) names(beta) = paste0('X', seq(1,p)) # Covariance matrix covmatrix = matrix(0,nrow=p,ncol=p) diag(covmatrix) = 1 for(i in 1:(p0/2)){ covmatrix[2*i-1,2*i] = covmatrix[2*i,2*i-1] = rho } # Generate the predictors given the correlation structure set.seed(seed) x = mvrnorm(n,mu=rep(0,p),Sigma=covmatrix) x = scale(x,center=TRUE,scale=FALSE) colnorm = apply(x,2,function(m){sqrt(sum(m^2))}) x = scale(x,center=FALSE,scale=colnorm) # standardization # Sigma calculated based on SNR sd = sqrt(t(beta/colnorm)%*%covmatrix%*%(beta/colnorm) / SNR) mu = x%*%beta # Generate replications of y by fixing X y = matrix(rep(mu,each=nrep),ncol=nrep,byrow=TRUE) + scale(matrix(rnorm(n*nrep,mean=0,sd=sd),nrow=n,ncol=nrep),center=TRUE,scale=FALSE) return(list(x=x, y=y, beta=beta, sigma=sd)) } dataset = simu.data(n, p, p0, rho, nrep, SNR, seed) x = dataset$x y = dataset$y beta = dataset$beta mu = x%*%beta sigma = dataset$sigma ``` The first $p_0=6$ predictors are active with $\beta_j \ne 0$. ```{r} print(beta) ``` ## An illustration of the package Fitting the model is simple. ```{r} library(BOSSreg) # Choose a single replication as illustration rep = seed # Fit the model boss_model = boss(x, y[,rep], intercept = FALSE) ``` The 'boss' object contains estimated coefficient vectors for the entire solution paths of both BOSS and FS. ```{r} betahat_boss = boss_model$beta_boss betahat_fs = boss_model$beta_fs print(dim(betahat_boss)) ``` By default, it also provides the hdf for BOSS and multiple information criteria. ```{r, fig.width=3, fig.height=3, fig.show='hold'} # The heuristic degrees of freedom plot(0:p, boss_model$hdf, main='hdf', ylab='', xlab='subset size', type='b') abline(0, 1, lty=2) # AICc-hdf (scaled by 1/n, and up to a constant) plot(0:p, boss_model$IC_boss$aicc, main='AICc-hdf', ylab='', xlab='subset size', type='b') ``` The optimal estimated coefficient vector and fitted mean vector can be obtained as follows. ```{r} # The default is chosen by AICc betahat_aicc = coef(boss_model) muhat_aicc = predict(boss_model, newx=x) # Use Cp rather than AICc betahat_cp = coef(boss_model, ic='cp') muhat_cp = predict(boss_model, newx=x) ``` In addition to information criteria, K-fold cross-validation (CV) with multiple replications can be used as a selection rule, with 10-fold CV with one replication the default choice. ```{r} # The default is 10-fold CV with 1 replication set.seed(seed) boss_cv_model = cv.boss(x, y[,rep], intercept=FALSE) # Coefficient vector selected by minimizing CV error betahat_cv = coef(boss_cv_model) # Fitted values muhat_cv = predict(boss_cv_model, newx=x) ``` Calling 'cv.boss' runs CV for FS as well. ```{r} # Coefficient vector for FS selected by CV betahat_fs_cv = coef(boss_cv_model, method='fs') # Fitted values muhat_fs_cv = predict(boss_cv_model, newx=x, method='fs') ``` Here is a comparison of the coefficient vectors selected using different selection rules. The first three columns are for BOSS while the last column is for FS. ```{r} tmp = cbind(betahat_aicc, betahat_cp, betahat_cv, betahat_fs_cv) dimnames(tmp) = list(dimnames(tmp)[[1]], c('B0SS AICc', 'BOSS Cp', 'BOSS CV', 'FS CV')) print(tmp) ``` ## Comparing the solutions of BOSS and FS at every subset size We see that FS gives a denser solution than BOSS in this case. Under the specific design of the true model, the true active predictors ($X_1,\cdots, X_6$) are pairwisely correlated with opposite effects. Predictors (e.g. $(X_1,X_2)$) together lead to a high $R^2$ but each single one of them contributes little. As a result, FS can have trouble stepping in the true active predictors in the early stages. For example, as indicated below, the inactive predictor $X_9$ joins in the first step. On the contrary, BOSS takes the same order of predictors as FS, and performs best subset regression on their orthogonal basis, providing the chance to re-evaluate (or re-order) the predictors. ```{r} # X9 joins first print(boss_model$steps_x) ``` Let's set aside the selection rule for now, and compare the solutions of the two methods at every subset size. The subset size is the number of predictors $k$ for FS, and it is the number of We calculate the average RMSE at each subset size based on $1000$ replications. The RMSE is defined as \begin{equation*} \text{RMSE} = \sqrt{\frac{1}{n} \lVert \hat{\mu} - X\beta \rVert_2^2}. \end{equation*} ```{r, eval=FALSE} # Function to calculate RMSE calc.rmse <- function(muhat){ sqrt( Matrix::colSums(sweep(muhat, 1, mu)^2) / n ) } rmse_solutionpath = list(BOSS=list(), FS=list()) for(rep in 1:nrep){ boss_model = boss(x, y[,rep], intercept=FALSE) # RMSE along the solution path rmse_solutionpath[['BOSS']][[rep]] = calc.rmse(x %*% boss_model$beta_boss) rmse_solutionpath[['FS']][[rep]] = calc.rmse(x %*% boss_model$beta_fs) } # saveRDS(rmse_solutionpath, 'vignettes/rmse_solutionpath.rds') ``` ```{r, include=FALSE} rmse_solutionpath = readRDS(gzcon(url('https://raw.githubusercontent.com/sentian/BOSSreg/master/r-package/vignettes/rmse_solutionpath.rds'))) ``` BOSS clearly provides a better solution path than FS in steps less than $8$. ```{r, fig.width=5, fig.height=4} # Average RMSE over replications rmse_avg = lapply(rmse_solutionpath, function(xx){colMeans(do.call(rbind, xx))}) plot(0:p, rmse_avg$FS, col='blue', pch=1, main='Average RMSE along the solution path', ylab='RMSE', xlab='Subset size') points(0:p, rmse_avg$BOSS, col='red', pch=3) legend('topright', legend = c('FS', 'BOSS'), col=c('blue', 'red'), pch=c(1,3)) ``` Next, we bring back the selection rules and compare their performances. The selection rule for BOSS is AICc-hdf and is 10-fold CV for FS. BOSS shows a better predictive performance, and it provides sparser solutions than FS does. ```{r, eval=FALSE} rmse = nvar = list(BOSS=c(), FS=c()) set.seed(seed) for(rep in 1:nrep){ boss_cv_model = cv.boss(x, y[,rep], intercept=FALSE) # RMSE for the optimal subset selected via a selection rule rmse[['BOSS']][rep] = calc.rmse(predict(boss_cv_model$boss, newx = x)) # AICc rmse[['FS']][rep] = calc.rmse(predict(boss_cv_model, newx = x, method = 'fs')) # CV # Number of variables nvar[['BOSS']][rep] = sum(coef(boss_cv_model$boss)!=0) nvar[['FS']][rep] = sum(coef(boss_cv_model, method='fs')!=0) } # saveRDS(list(rmse=rmse, nvar=nvar), '/vignettes/boss_fs.rds') ``` ```{r, include=FALSE} tmp = readRDS(gzcon(url('https://raw.githubusercontent.com/sentian/BOSSreg/master/r-package/vignettes/boss_fs.rds'))) rmse = tmp$rmse nvar = tmp$nvar ``` ```{r, fig.width=3, fig.height=3, fig.show='hold'} # Make the plots boxplot(rmse, outline=FALSE, main='RMSE') boxplot(nvar, outline=FALSE, main='Number of predictors') ``` ## Real data examples We compare the performance of BOSS with FS and some popular regularization methods on several real datasets. We consider four datasets from the [StatLib library](http://lib.stat.cmu.edu/datasets/), 'boston housing', 'hitters', 'auto' and 'college'. An intercept term is included in all of the procedures. We present the results in this section and provide the code in the Appendix at the end of this document. The selection rule is AICc for BOSS and LASSO, and 10-fold CV for FS and SparseNet, respectively. We use the **R** packages *glmnet* and *sparsenet* to fit LASSO and SparseNet, respectively. We see that BOSS has the minimum RMSE for the 'hitters' and 'auto' datasets, while LASSO has the minimum RMSE for the 'boston housing' and 'college' datasets. Due to an efficient implementation of the cyclic coordinate descent, the 'glmnet' algorithm provides an extremely fast LASSO solution. BOSS is also relatively computationally efficient, and is much faster than the remaining methods. ```{r, include=FALSE} library(ISLR) dataset = list() # Boston Housing data tmp = Boston tmp = na.omit(tmp) tmp$chas = as.factor(tmp$chas) dataset$boston$x = data.matrix(tmp[,!names(tmp) %in% 'medv']) dataset$boston$y = tmp$medv # MLB hitters salary tmp = Hitters tmp = na.omit(tmp) tmp[,c('League', 'Division', 'NewLeague')] = lapply(tmp[,c('League', 'Division', 'NewLeague')], as.factor) dataset$hitters$x = data.matrix(tmp[,!(names(tmp) %in% c('Salary'))]) dataset$hitters$y = tmp$Salary # College data tmp = College tmp$Private = as.factor(tmp$Private) dataset$college$x = data.matrix(tmp[,!(names(tmp) %in% c('Outstate'))]) dataset$college$y = tmp$Outstate # Auto data tmp = Auto dataset$auto$x = data.matrix(tmp[,!(names(tmp) %in% c('mpg','name','origin'))]) dataset$auto$y = tmp$mpg ``` ```{r, echo=FALSE} library(knitr) library(kableExtra) # read the results result = readRDS(gzcon(url('https://raw.githubusercontent.com/sentian/BOSSreg/master/r-package/vignettes/realdata.rds'))) # function to extract the results tmp_function <- function(method){ unlist(lapply(result, function(xx){ unlist(lapply(xx, function(yy){ round(mean(yy[[method]]), 3) })) })) } tmp = data.frame(Dataset = rep(names(result), each=3), n_p = rep(unlist(lapply(dataset, function(xx){paste(dim(xx$x), collapse = ', ')})) , each=3), Metrics = rep(c('RMSE', '# predictors', 'running time (s)'), length(result)), BOSS = tmp_function('boss'), FS = tmp_function('fs'), LASSO = tmp_function('lasso'), SparseNet = tmp_function('sparsenet')) rownames(tmp) = NULL colnames(tmp)[2] = 'n, p' kable(tmp, align = "c") %>% kable_styling(full_width = F) %>% column_spec(1, bold = T) %>% collapse_rows(columns = 1:2, valign = "middle") ``` ## References
\newpage ## Appendix: Code for the real data examples The following code is used to pre-process the datasets. We remove all of the entries with 'NA' values. We recast binary categorical variables into $\{0,1\}$ and remove categorical variables with more than two categories. ```{r, eval=FALSE} library(ISLR) dataset = list() # Boston Housing data tmp = Boston tmp = na.omit(tmp) tmp$chas = as.factor(tmp$chas) dataset$boston$x = data.matrix(tmp[,!names(tmp) %in% 'medv']) dataset$boston$y = tmp$medv # MLB hitters salary tmp = Hitters tmp = na.omit(tmp) tmp[,c('League', 'Division', 'NewLeague')] = lapply(tmp[,c('League', 'Division', 'NewLeague')], as.factor) dataset$hitters$x = data.matrix(tmp[,!(names(tmp) %in% c('Salary'))]) dataset$hitters$y = tmp$Salary # College data tmp = College tmp$Private = as.factor(tmp$Private) dataset$college$x = data.matrix(tmp[,!(names(tmp) %in% c('Outstate'))]) dataset$college$y = tmp$Outstate # Auto data tmp = Auto dataset$auto$x = data.matrix(tmp[,!(names(tmp) %in% c('mpg','name','origin'))]) dataset$auto$y = tmp$mpg ``` Code to calculate leave-one-out error, number of predictors and timing for each fitting procedure. Note that the following code took roughly 20 minutes to run on a single core of a local machine with a 2.7 GHz i7 processer and 16 GB RAM. ```{r, eval=FALSE} library(glmnet) library(sparsenet) rmse <- function(y_hat, y){ sqrt(sum( (y_hat - y)^2 / length(y)) ) } rdresult <- function(x, y, nrep, seed){ p = dim(x)[2] allmethods = c('lasso','sparsenet','boss','fs') error = numvar = time = replicate(length(allmethods), rep(NA,nrep), simplify=F) names(error) = names(numvar) = names(time) = allmethods set.seed(seed) for(i in 1:nrep){ index = 1:nrow(x) index = index[-i] x.train = x[index, , drop=FALSE] y.train = y[index] x.test = x[-index, , drop=FALSE] x.test.withint = cbind(rep(1,nrow(x.test)), x.test) y.test = y[-index] # BOSS ptm = proc.time() boss_model = boss(x.train, y.train, intercept = TRUE) time_tmp = proc.time() - ptm boss_pred = as.numeric( predict(boss_model, newx=x.test) ) error$boss[i] = rmse(boss_pred, y.test) numvar$boss[i] = sum(coef(boss_model)!=0) time$boss[i] = time_tmp[3] # FS ptm = proc.time() boss_cv_model = cv.boss(x.train, y.train) time_tmp = proc.time() - ptm fs_pred = as.numeric( predict(boss_cv_model, newx=x.test, method='fs') ) error$fs[i] = rmse(fs_pred, y.test) numvar$fs[i] = sum(coef(boss_cv_model, method='fs')!=0) time$fs[i] = time_tmp[3] # LASSO ptm = proc.time() lasso_model = glmnet(x.train, y.train, intercept=TRUE) lasso_aicc = as.numeric(calc.ic(predict(lasso_model, newx=x.train), y.train, ic='aicc', df=lasso_model$df+1)) lasso_pred = predict(lasso_model, newx=x.test, s=lasso_model$lambda[which.min(lasso_aicc)]) time_tmp = proc.time() - ptm error$lasso[i] = rmse(lasso_pred, y.test) numvar$lasso[i] = sum(coef(lasso_model, s=lasso_model$lambda[which.min(lasso_aicc)])!=0) time$lasso[i] = time_tmp[3] # SparseNet ptm = proc.time() sparsenet_cv_model = cv.sparsenet(x.train, y.train) time_tmp = proc.time() - ptm sparsenet_pred = predict(sparsenet_cv_model, newx=x.test, which='parms.min') error$sparsenet[i] = rmse(sparsenet_pred, y.test) numvar$sparsenet[i] = sum(coef(sparsenet_cv_model, which='parms.min')!=0) time$sparsenet[i] = time_tmp[3] } return(list(error=error, numvar=numvar, time=time)) } result = lapply(dataset, function(xx){rdresult(xx$x, xx$y, nrow(xx$x), seed)}) # saveRDS(result, '/vignettes/realdata.rds') ``` This is the code to construct the table on page 6. ```{r, eval=FALSE} library(knitr) library(kableExtra) # Function to extract the results tmp_function <- function(method){ unlist(lapply(result, function(xx){ unlist(lapply(xx, function(yy){ round(mean(yy[[method]]), 3) })) })) } tmp = data.frame(Dataset = rep(names(result), each=3), n_p = rep(unlist(lapply(dataset, function(xx){paste(dim(xx$x), collapse = ', ')})) , each=3), Metrics = rep(c('RMSE', '# predictors', 'running time (s)'), length(result)), BOSS = tmp_function('boss'), FS = tmp_function('fs'), LASSO = tmp_function('lasso'), SparseNet = tmp_function('sparsenet')) rownames(tmp) = NULL colnames(tmp)[2] = 'n, p' kable(tmp, align = "c") %>% kable_styling(full_width = F) %>% column_spec(1, bold = T) %>% collapse_rows(columns = 1:2, valign = "middle") ```