--- title: Guide to grpsel author: Ryan Thompson output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Guide to grpsel} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5 ) ``` ## Introduction `grpsel` is an R package for group subset selection (see [https://arxiv.org/abs/2105.12081](https://arxiv.org/abs/2105.12081)). For a response vector $\mathbf{y}=(y_1,\ldots,y_n)^\top$, predictor matrix $\mathbf{X}=(\mathbf{x}_1,\ldots,\mathbf{x}_n)^\top$, and a set of $g$ groups, `grpsel` is capable of approximately solving problems of the form: $$ \min_\beta \sum_{i=1}^n\ell(\mathbf{x}_i^\top\beta,y_i)+\lambda\sum_{k=1}^g1(\|\beta_k\|\neq0)+\gamma\sum_{k=1}^g\|\beta_k\|^q,\quad q\in\{1,2\} $$ where the first term is a loss function (square or logistic), the second term is a group subset selection penalty, and the third term is a shrinkage penalty (specifically, a group lasso penalty if $q=1$ and a ridge penalty if $q=2$). The notation $\beta_k$ denotes the coefficients $\beta$ belonging to the $k$th group. Group-sparse regression arises in numerous settings in modern data analytic work, including selection with categorical predictors, multitask (multiresponse) learning, hierarchical selection, and nonparametric additive regression. We demonstrate some examples of these applications below. The `grpsel` package provides a simple set of functions for handling grouped selection in R. The two main functions provided by the package are `grpsel()` and `cv.grpsel()`, responsible for model fitting and cross-validation, respectively. The `grpsel()` function provides a convenient way of performing group subset selection for a path of $\lambda$. To demonstrate this functionality, let's simulate some grouped data. ```{r} set.seed(123) n <- 100 # Number of observations p <- 10 # Number of predictors g <- 5 # Number of groups group <- rep(1:g, each = p / g) # Group structure beta <- numeric(p) beta[which(group %in% 1:2)] <- 1 # First two groups are nonzero x <- matrix(rnorm(n * p), n, p) y <- x %*% beta + rnorm(n) ``` The first two groups explain the response while the rest are unimportant. ```{r} library(grpsel) fit <- grpsel(x, y, group) ``` The `group` argument is optional, and if left unspecified, each predictor will be assigned to its own group (leading to ungrouped variable selection). The values of $\lambda$ are automatically computed from the data, providing a path of solutions from the null model to the full model. These solutions can be extracted using the `coef()` function. ```{r} coef(fit) ``` Each of the columns above corresponds to a set of estimated coefficients for a particular value of $\lambda$, with the first row containing the intercept terms. These coefficients can be visualised via the `plot()` function. ```{r} plot(fit) ``` The plot above omits the intercept terms and displays the coefficients by the number of selected groups rather than $\lambda$. The `predict()` function is available to make predictions for new data. ```{r} x.new <- matrix(rnorm(10 * p), 10, p) predict(fit, x.new) ``` Again, the columns represent predictions for different values of $\lambda$. By default, `grpsel` sets $\gamma=0$. In certain situations, the shrinkage induced by the setting $\gamma>0$ is desirable (e.g., when the level of noise is high). To fit the model for both $\lambda$ and $\gamma$, use the argument `penalty='grSubset+grLasso'` or `penalty='grSubset+Ridge'`. To extract the coefficients for specific values of $\lambda$ and $\gamma$, the `lambda` and `gamma` arguments of `coef()` are available. ```{r} fit <- grpsel(x, y, group, penalty = 'grSubset+grLasso') coef(fit, lambda = 0.05, gamma = 0.1) fit <- grpsel(x, y, group, penalty = 'grSubset+Ridge') coef(fit, lambda = 0.05, gamma = 0.1) ``` Similar arguments exist for `predict()`. In practice, $\lambda$ and $\gamma$ usually need to be cross-validated. The `cv.grpsel()` function provides a convenient way to perform group subset selection with cross-validation. ```{r} fit <- cv.grpsel(x, y, group, penalty = 'grSubset+Ridge', nfold = 10) # 10-fold cross-validation ``` The cross-validation results are easily visualised using the `plot()` function. ```{r} plot(fit) ``` The plot above shows the cross-validated loss (square or logistic) as a function of the number of selected groups for the best cross-validated value of $\gamma$. Plots for other values of $\gamma$ can be produced using the `gamma` argument of `plot()`. The `coef()` and `predict()` functions applied to the output of `cv.grpsel()` return results corresponding to the values of $\lambda$ and $\gamma$ that minimise the cross-validated mean square error. ```{r} coef(fit) predict(fit, x.new) ``` `grpsel()` does not need to be run after using `cv.grpsel()`, as the latter calls the former and saves the result as `fit$fit`. Finally, to perform a logistic regression fit, set `loss='logistic'`. ```{r} y <- rbinom(n, 1, 1 / (1 + exp(- x %*% beta))) fit <- cv.grpsel(x, y, group, loss = 'logistic') coef(fit) ``` ## Overlapping groups It is straightforward to model overlapping groups using `grpsel`. To demonstrate, suppose there are ten predictors spread among two groups: $\{1,2,3,4,5,6\}$ and $\{5,6,7,8,9,10\}$, where $x_5$ and $x_6$ belong to both groups. ```{r} x <- matrix(rnorm(n * p), n, p) y <- rowSums(x) + rnorm(n) group <- list(1:6, 5:10) fit <- grpsel(x, y, group) ``` The object `group` is now a list of groups rather than a vector. ```{r} coef(fit) ``` Under the hood, the overlapping groups are handled using a latent coefficient approach. See [https://arxiv.org/abs/2105.12081](https://arxiv.org/abs/2105.12081) for more information. ## Algorithms The primary algorithm driving `grpsel` is coordinate descent. Sometimes when the groups are strongly correlated, the estimate produced by coordinate descent can be improved using local search. This algorithm runs on top of coordinate descent. To use local search, set `local.search=T`. ```{r} group <- rep(1:g, each = p / g) x <- matrix(rnorm(n * p), n, p) + matrix(rnorm(n), n, p) beta[which(group %in% 1:2)] <- 1 # First two groups are nonzero y <- x %*% beta + rnorm(n) fit <- cv.grpsel(x, y, group) coef(fit) fit <- cv.grpsel(x, y, group, local.search = T) coef(fit) ``` The correct groups are not selected without local search in this high-correlation example. ## Demonstrations In this section, we show how `grpsel` can be used to fit a variety of statistical models. ### Multitask learning Multitask learning is useful where the response is a matrix and it is suspected that each of its columns can be explained by the same subset of predictors. In this case, we would like to perform a single fit rather than individual fits on each column. In this example, we will simulate ten response variables, each depending on the first five predictors. ```{r} m <- 10 # Number of response variables beta <- matrix(0, p, m) beta[1:5, ] <- 1 x <- matrix(rnorm(n * p), n, p) y <- x %*% beta + matrix(rnorm(n * m), n, m) y <- matrix(y, ncol = 1) x <- diag(m) %x% x group <- rep(1:p, m) cvfit <- cv.grpsel(x, y, group) matrix(coef(cvfit)[- 1, , drop = F], ncol = m) ``` The groups have enforced the constraint that all ten response variables share the same subset of predictors. ### Nonparametric additive regression Group selection can be used to fit sparse nonparametric additive models of the form $$y=\sum_{j=1}^pf_j(x_j)+\epsilon.$$ To fit these models we can approximate $f_1,\ldots,f_p$ using regression splines. In this example, the predictors are uniform on $[0,1]$ and the generating functions are $$ \begin{aligned} f_1(x) &= \sin(2\pi x), \\ f_2(x) &= \cos(2\pi x), \\ f_3(x) &= 0,\,\text{and} \\ &~~\vdots \\ f_p(x) &= 0. \end{aligned} $$ We will use natural cubic splines with five basis functions to fit this model. The basis functions of each spline form a group, and we have one such group for each predictor. ```{r} x <- matrix(runif(n * p), n, p) y <- sinpi(2 * x[, 1]) + cospi(2 * x[, 2]) + rnorm(n, sd = 0.1) df <- 5 splines <- lapply(1:p, \(j) splines::ns(x[, j], df = df)) x.s <- do.call(cbind, splines) group <- rep(1:p, each = df) fit <- cv.grpsel(x.s, y, group) ``` Let's check that the first two predictors have been selected correctly. ```{r} unique(group[coef(fit)[- 1] != 0]) ``` The fitted functions can be plotted as follows. ```{r} library(ggplot2) beta0 <- coef(fit)[1] beta <- coef(fit)[- 1] int.x1 <- (beta0 + colMeans(x.s[, - (1:df)]) %*% beta[- (1:df)])[, ] int.x2 <- (beta0 + colMeans(x.s[, - (1:df + df)]) %*% beta[- (1:df + df)])[, ] ggplot(x = seq(- 1, 1, length.out = 101)) + stat_function(fun = \(x) sinpi(2 * x), aes(linetype = 'True function')) + stat_function(fun = \(x) int.x1 + predict(splines[[1]], x) %*% beta[1:df], aes(linetype = 'Fitted function')) + xlab('x') + ylab('f(x)') ggplot(x = seq(- 1, 1, length.out = 101)) + stat_function(fun = \(x) cospi(2 * x), aes(linetype = 'True function')) + stat_function(fun = \(x) int.x2 + predict(splines[[2]], x) %*% beta[1:df + df], aes(linetype = 'Fitted function')) + xlab('x') + ylab('f(x)') ``` ### Hierarchical interactions When modelling interactions between predictors, it is often desirable to enforce the condition that an interaction can be selected only when its constituent predictors are selected, i.e., the coefficient on $x_1x_2$ can be nonzero only when the coefficients on $x_1$ and $x_2$ are nonzero. It is straightforward to enforce this hierarchical constraint using group selection. In this example, the main effects $x_1$, $x_2$, and $x_3$ are nonzero, as well as the interaction $x_1x_2$. ```{r} x <- matrix(rnorm(n * p), n, p) y <- x[, 1] + x[, 2] + x[, 3] + x[, 1] * x[, 2] + rnorm(n, sd = 0.1) x.int <- model.matrix(~ - 1 + . ^ 2, data = as.data.frame(x)) group <- c(1:p, mapply(c, combn(1:p, 2, simplify = F), 1:choose(p, 2) + p, SIMPLIFY = F)) fit <- cv.grpsel(x.int, y, group) colnames(x.int)[coef(fit)[- 1] != 0] ``` The fitted model respects the hierarchy constraint.