| Title: | Continuous Optimisation Towards Best Subset Selection |
|---|---|
| Description: | Best subset selection in generalised linear models via continuous optimisation. Reformulates the NP-hard discrete subset selection problem as a continuous optimisation over the hypercube [0,1]^p, solved via a Frank-Wolfe homotopy algorithm with closed-form ridge inner solves. Supports linear (Gaussian), binary logistic, and multinomial regression. For methodological details see Moka, Liquet, Zhu and Muller (2024) <doi:10.1007/s11222-024-10387-8> and Mathur, Liquet, Muller and Moka (2026) <doi:10.48550/arXiv.2603.21952>. |
| Authors: | Benoit Liquet [aut, cre] (ORCID: <https://orcid.org/0000-0002-8136-2294>), Anant Mathur [aut], Sarat Moka [aut] |
| Maintainer: | Benoit Liquet <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-05-11 22:09:33 UTC |
| Source: | https://github.com/cran/combss |
Extract the subset of selected feature indices.
## S3 method for class 'combss' coef(object, k = NULL, ...)## S3 method for class 'combss' coef(object, k = NULL, ...)
object |
A |
k |
Subset size to return. Defaults to the best |
... |
Unused. |
Integer vector of selected column indices (1-indexed).
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- as.numeric(x[, 1] + x[, 2] + rnorm(100) * 0.1) fit <- combss(x, y, family = "gaussian", q = 5) coef(fit, k = 2)set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- as.numeric(x[, 1] + x[, 2] + rnorm(100) * 0.1) fit <- combss(x, y, family = "gaussian", q = 5) coef(fit, k = 2)
Reformulates the NP-hard discrete subset selection problem as a continuous
optimisation over the hypercube [0, 1]^p, solved via a Frank-Wolfe
homotopy algorithm. The inner ridge problem is solved with glmnet. For
each subset size k = 1, ..., q, COMBSS returns the selected feature set;
if validation data is supplied, the best k is chosen by validation MSE
(Gaussian) or accuracy (binomial / multinomial).
combss( x, y, family = c("gaussian", "binomial", "multinomial"), q = NULL, x_val = NULL, y_val = NULL, lam_ridge = 0, Niter = 25, alpha = 0.01, scale = TRUE, mandatory = NULL, verbose = FALSE )combss( x, y, family = c("gaussian", "binomial", "multinomial"), q = NULL, x_val = NULL, y_val = NULL, lam_ridge = 0, Niter = 25, alpha = 0.01, scale = TRUE, mandatory = NULL, verbose = FALSE )
x |
Numeric design matrix |
y |
Response. Numeric for |
family |
One of |
q |
Maximum subset size. Defaults to |
x_val, y_val
|
Optional validation data. When provided, the best |
lam_ridge |
Ridge penalty for the inner solver. Default |
Niter |
Number of homotopy iterations. Default |
alpha |
Frank-Wolfe step size. Default |
scale |
Column-normalise |
mandatory |
Optional integer vector of 1-indexed columns of |
verbose |
Print progress. Default |
An object of class "combss", a list with elements:
family: one of "gaussian", "binomial", "multinomial"
subset_list: list of length up to q; element k is an integer
vector of selected column indices (1-indexed) for subset size k
subset: best subset (requires validation data; otherwise NULL)
mse / accuracy: validation metric for the best subset
(requires validation data)
lam_ridge: ridge penalty used
q: maximum subset size evaluated
Moka, S., Liquet, B., Zhu, H. and Muller, S. (2024). COMBSS: best subset selection via continuous optimization. Statistics and Computing. doi:10.1007/s11222-024-10387-8
Mathur, A., Liquet, B., Muller, S. and Moka, S. (2026). Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications. arXiv:2603.21952.
set.seed(1) n <- 200; p <- 30 beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5)) x <- matrix(rnorm(n * p), n, p) y <- as.numeric(x %*% beta + rnorm(n) * 0.5) fit <- combss(x, y, family = "gaussian", q = 10) fit$subset_list[1:5]set.seed(1) n <- 200; p <- 30 beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5)) x <- matrix(rnorm(n * p), n, p) y <- as.numeric(x %*% beta + rnorm(n) * 0.5) fit <- combss(x, y, family = "gaussian", q = 10) fit$subset_list[1:5]
Mirrors combss.cv.select_lambda in the Python combss package. For each
candidate lambda and each subset size k = 1, ..., q, runs COMBSS to
obtain a subset and evaluates LOOCV error of a refit. Returns the best
lambda overall and the best lambda per k.
combss_cv( x, y, family = c("gaussian", "binomial", "multinomial"), q, lambda_grid = NULL, Niter = 25, verbose = FALSE )combss_cv( x, y, family = c("gaussian", "binomial", "multinomial"), q, lambda_grid = NULL, Niter = 25, verbose = FALSE )
x |
Numeric design matrix |
y |
Response (numeric / binary / factor depending on |
family |
One of |
q |
Maximum subset size. |
lambda_grid |
Candidate ridge penalties. Defaults to
|
Niter |
Number of homotopy iterations passed to |
verbose |
Print progress. |
List with:
best_lambda: ridge penalty with best mean LOOCV metric across k
best_lambda_per_k: numeric vector, length q
results: data frame (lambda, k, loocv_metric)
combss fit.Refits the chosen subset on the original training data via OLS or ridge
(Gaussian / binomial / multinomial as appropriate) and predicts on newx.
Note: the training data is not stored on the fit object, so newx must be
supplied along with x_train and y_train.
## S3 method for class 'combss' predict(object, newx, x_train, y_train, k = NULL, ...)## S3 method for class 'combss' predict(object, newx, x_train, y_train, k = NULL, ...)
object |
A |
newx |
Numeric matrix of new observations. |
x_train, y_train
|
Original training data used to refit on the chosen subset. |
k |
Subset size to use. Defaults to the best |
... |
Unused. |
Numeric vector of predictions for newx.
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- as.numeric(x[, 1] + x[, 2] + rnorm(100) * 0.1) fit <- combss(x, y, family = "gaussian", q = 5) newx <- matrix(rnorm(5 * 20), 5, 20) predict(fit, newx = newx, x_train = x, y_train = y, k = 2)set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- as.numeric(x[, 1] + x[, 2] + rnorm(100) * 0.1) fit <- combss(x, y, family = "gaussian", q = 5) newx <- matrix(rnorm(5 * 20), 5, 20) predict(fit, newx = newx, x_train = x, y_train = y, k = 2)