Title: | Fast Best Subset Selection |
---|---|
Description: | Extremely efficient toolkit for solving the best subset selection problem <https://www.jmlr.org/papers/v23/21-1060.html>. This package is its R interface. The package implements and generalizes algorithms designed in <doi:10.1073/pnas.2014241117> that exploits a novel sequencing-and-splicing technique to guarantee exact support recovery and globally optimal solution in polynomial times for linear model. It also supports best subset selection for logistic regression, Poisson regression, Cox proportional hazard model, Gamma regression, multiple-response regression, multinomial logistic regression, ordinal regression, (sequential) principal component analysis, and robust principal component analysis. The other valuable features such as the best subset of group selection <doi:10.1287/ijoc.2022.1241> and sure independence screening <doi:10.1111/j.1467-9868.2008.00674.x> are also provided. |
Authors: | Jin Zhu [aut, cre] , Zezhi Wang [aut], Liyuan Hu [aut], Junhao Huang [aut], Kangkang Jiang [aut], Yanhang Zhang [aut], Borui Tang [aut], Shiyun Lin [aut], Junxian Zhu [aut], Canhong Wen [aut], Heping Zhang [aut] , Xueqin Wang [aut] , spectra contributors [cph] (Spectra implementation) |
Maintainer: | Jin Zhu <[email protected]> |
License: | GPL (>= 3) | file LICENSE |
Version: | 0.4.9 |
Built: | 2024-12-09 06:59:14 UTC |
Source: | CRAN |
Adaptive best-subset selection for regression, (multi-class) classification, counting-response, censored-response, positive response, multi-response modeling in polynomial times.
## Default S3 method: abess( x, y, family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial", "gamma", "ordinal"), tune.path = c("sequence", "gsection"), tune.type = c("gic", "ebic", "bic", "aic", "cv"), weight = NULL, normalize = NULL, fit.intercept = TRUE, beta.low = -.Machine$double.xmax, beta.high = .Machine$double.xmax, c.max = 2, support.size = NULL, gs.range = NULL, lambda = 0, always.include = NULL, group.index = NULL, init.active.set = NULL, splicing.type = 2, max.splicing.iter = 20, screening.num = NULL, important.search = NULL, warm.start = TRUE, nfolds = 5, foldid = NULL, cov.update = FALSE, newton = c("exact", "approx"), newton.thresh = 1e-06, max.newton.iter = NULL, early.stop = FALSE, ic.scale = 1, num.threads = 0, seed = 1, ... ) ## S3 method for class 'formula' abess(formula, data, subset, na.action, ...)
## Default S3 method: abess( x, y, family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial", "gamma", "ordinal"), tune.path = c("sequence", "gsection"), tune.type = c("gic", "ebic", "bic", "aic", "cv"), weight = NULL, normalize = NULL, fit.intercept = TRUE, beta.low = -.Machine$double.xmax, beta.high = .Machine$double.xmax, c.max = 2, support.size = NULL, gs.range = NULL, lambda = 0, always.include = NULL, group.index = NULL, init.active.set = NULL, splicing.type = 2, max.splicing.iter = 20, screening.num = NULL, important.search = NULL, warm.start = TRUE, nfolds = 5, foldid = NULL, cov.update = FALSE, newton = c("exact", "approx"), newton.thresh = 1e-06, max.newton.iter = NULL, early.stop = FALSE, ic.scale = 1, num.threads = 0, seed = 1, ... ) ## S3 method for class 'formula' abess(formula, data, subset, na.action, ...)
x |
Input matrix, of dimension |
y |
The response variable, of |
family |
One of the following models:
|
tune.path |
The method to be used to select the optimal support size. For
|
tune.type |
The type of criterion for choosing the support size.
Available options are |
weight |
Observation weights. When |
normalize |
Options for normalization.
|
fit.intercept |
A boolean value indicating whether to fit an intercept.
We assume the data has been centered if |
beta.low |
A single value specifying the lower bound of |
beta.high |
A single value specifying the upper bound of |
c.max |
an integer splicing size. Default is: |
support.size |
An integer vector representing the alternative support sizes.
Only used for |
gs.range |
A integer vector with two elements.
The first element is the minimum model size considered by golden-section,
the later one is the maximum one. Default is |
lambda |
A single lambda value for regularized best subset selection. Default is 0. |
always.include |
An integer vector containing the indexes of variables that should always be included in the model. |
group.index |
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of |
init.active.set |
A vector of integers indicating the initial active set.
Default: |
splicing.type |
Optional type for splicing.
If |
max.splicing.iter |
The maximum number of performing splicing algorithm.
In most of the case, only a few times of splicing iteration can guarantee the convergence.
Default is |
screening.num |
An integer number. Preserve |
important.search |
An integer number indicating the number of
important variables to be splicing.
When |
warm.start |
Whether to use the last solution as a warm start. Default is |
nfolds |
The number of folds in cross-validation. Default is |
foldid |
an optional integer vector of values between 1, ..., nfolds identifying what fold each observation is in.
The default |
cov.update |
A logical value only used for |
newton |
A character specify the Newton's method for fitting generalized linear models,
it should be either |
newton.thresh |
a numeric value for controlling positive convergence tolerance.
The Newton's iterations converge when |
max.newton.iter |
a integer giving the maximal number of Newton's iteration iterations.
Default is |
early.stop |
A boolean value decide whether early stopping.
If |
ic.scale |
A non-negative value used for multiplying the penalty term
in information criterion. Default: |
num.threads |
An integer decide the number of threads to be
concurrently used for cross-validation (i.e., |
seed |
Seed to be used to divide the sample into cross-validation folds.
Default is |
... |
further arguments to be passed to or from methods. |
formula |
an object of class " |
data |
a data frame containing the variables in the |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates
what should happen when the data contain |
Best-subset selection aims to find a small subset of predictors,
so that the resulting model is expected to have the most desirable prediction accuracy.
Best-subset selection problem under the support size is
where is arbitrary convex functions. In
the GLM case,
is the log-likelihood function; in the Cox
model,
is the log partial-likelihood function.
The best subset selection problem is solved by the splicing algorithm in this package, see Zhu (2020) for details.
Under mild conditions, the algorithm exactly solve this problem in polynomial time.
This algorithm exploits the idea of sequencing and splicing to reach a stable solution in finite steps when
is fixed.
The parameters
c.max
, splicing.type
and max.splicing.iter
allow user control the splicing technique flexibly.
On the basis of our numerical experiment results, we assign properly parameters to the these parameters as the default
such that the precision and runtime are well balanced, we suggest users keep the default values unchanged.
Please see this online page for more details about the splicing algorithm.
To find the optimal support size ,
we provide various criterion like GIC, AIC, BIC and cross-validation error to determine it.
More specifically, the sequence of models implied by
support.size
are fit by the splicing algorithm.
And the solved model with least information criterion or cross-validation error is the optimal model.
The sequential searching for the optimal model is somehow time-wasting.
A faster strategy is golden section (GS), which only need to specify gs.range
.
More details about GS is referred to Zhang et al (2021).
It is worthy to note that the parameters newton
, max.newton.iter
and newton.thresh
allows
user control the parameter estimation in non-gaussian models.
The parameter estimation procedure use Newton method or approximated Newton method (only consider the diagonal elements in the Hessian matrix).
Again, we suggest to use the default values unchanged because the same reason for the parameter c.max
.
abess
support some well-known advanced statistical methods to analyze data, including
sure independent screening: helpful for ultra-high dimensional predictors (i.e., ). Use the parameter
screening.num
to retain the marginally most important predictors. See Fan et al (2008) for more details.
best subset of group selection: helpful when predictors have group structure. Use the parameter group.index
to specify the group structure of predictors. See Zhang et al (2021) for more details.
regularization best subset selection: helpful when signal-to-ratio is relatively small. Use the parameter
lambda
to control the magnitude of the regularization term.
nuisance selection: helpful when the prior knowledge of important predictors is available. Use the parameter always.include
to retain the important predictors.
The arbitrary combination of the four methods are definitely support.
Please see online vignettes for more details about the advanced features support by abess
.
A S3 abess
class object, which is a list
with the following components:
beta |
A |
intercept |
An intercept vector of length |
dev |
the deviance of length |
tune.value |
A value of tuning criterion of length |
nobs |
The number of sample used for training. |
nvars |
The number of variables used for training. |
family |
Type of the model. |
tune.path |
The path type for tuning parameters. |
support.size |
The actual |
edf |
The effective degree of freedom.
It is the same as |
best.size |
The best support size selected by the tuning value. |
tune.type |
The criterion type for tuning parameters. |
tune.path |
The strategy for tuning parameters. |
screening.vars |
The character vector specify the feature
selected by feature screening.
It would be an empty character vector if |
call |
The original call to |
Jin Zhu, Junxian Zhu, Canhong Wen, Heping Zhang, Xueqin Wang
A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117
A Splicing Approach to Best Subset of Groups Selection. Zhang, Yanhang, Junxian Zhu, Jin Zhu, and Xueqin Wang (2023). INFORMS Journal on Computing, 35:1, 104-119. doi:10.1287/ijoc.2022.1241
abess: A Fast Best-Subset Selection Library in Python and R. Zhu Jin, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. Journal of Machine Learning Research 23, no. 202 (2022): 1-7.
Sure independence screening for ultrahigh dimensional feature space. Fan, J. and Lv, J. (2008), Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70: 849-911. doi:10.1111/j.1467-9868.2008.00674.x
Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression. Qiang Sun & Heping Zhang (2020). Journal of the American Statistical Association, doi:10.1080/01621459.2020.1737079
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) n <- 100 p <- 20 support.size <- 3 ################ linear model ################ dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) ## helpful generic functions: print(abess_fit) coef(abess_fit, support.size = 3) predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4) ) str(extract(abess_fit, 3)) deviance(abess_fit) plot(abess_fit) plot(abess_fit, type = "tune") ################ logistic model ################ dataset <- generate.data(n, p, support.size, family = "binomial") ## allow cross-validation to tuning abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "binomial", tune.type = "cv" ) abess_fit ################ poisson model ################ dataset <- generate.data(n, p, support.size, family = "poisson") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "poisson", tune.type = "cv" ) abess_fit ################ Cox model ################ dataset <- generate.data(n, p, support.size, family = "cox") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "cox", tune.type = "cv" ) ################ Multivariate gaussian model ################ dataset <- generate.data(n, p, support.size, family = "mgaussian") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "mgaussian", tune.type = "cv" ) plot(abess_fit, type = "l2norm") ################ Multinomial model (multi-classification) ################ dataset <- generate.data(n, p, support.size, family = "multinomial") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "multinomial", tune.type = "cv" ) predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4), type = "response" ) ################ Ordinal regression ################ dataset <- generate.data(n, p, support.size, family = "ordinal", class.num = 4) abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "ordinal", tune.type = "cv" ) coef <- coef(abess_fit, support.size = abess_fit[["best.size"]])[[1]] predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4), type = "response" ) ########## Best group subset selection ############# dataset <- generate.data(n, p, support.size) group_index <- rep(1:10, each = 2) abess_fit <- abess(dataset[["x"]], dataset[["y"]], group.index = group_index) str(extract(abess_fit)) ################ Golden section searching ################ dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]], tune.path = "gsection") abess_fit ################ Feature screening ################ p <- 1000 dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]], screening.num = 100 ) str(extract(abess_fit)) ################ Sparse predictor ################ require(Matrix) p <- 1000 dataset <- generate.data(n, p, support.size) dataset[["x"]][abs(dataset[["x"]]) < 1] <- 0 dataset[["x"]] <- Matrix(dataset[["x"]]) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) str(extract(abess_fit)) ################ Formula interface ################ data("trim32") abess_fit <- abess(y ~ ., data = trim32) abess_fit
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) n <- 100 p <- 20 support.size <- 3 ################ linear model ################ dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) ## helpful generic functions: print(abess_fit) coef(abess_fit, support.size = 3) predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4) ) str(extract(abess_fit, 3)) deviance(abess_fit) plot(abess_fit) plot(abess_fit, type = "tune") ################ logistic model ################ dataset <- generate.data(n, p, support.size, family = "binomial") ## allow cross-validation to tuning abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "binomial", tune.type = "cv" ) abess_fit ################ poisson model ################ dataset <- generate.data(n, p, support.size, family = "poisson") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "poisson", tune.type = "cv" ) abess_fit ################ Cox model ################ dataset <- generate.data(n, p, support.size, family = "cox") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "cox", tune.type = "cv" ) ################ Multivariate gaussian model ################ dataset <- generate.data(n, p, support.size, family = "mgaussian") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "mgaussian", tune.type = "cv" ) plot(abess_fit, type = "l2norm") ################ Multinomial model (multi-classification) ################ dataset <- generate.data(n, p, support.size, family = "multinomial") abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "multinomial", tune.type = "cv" ) predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4), type = "response" ) ################ Ordinal regression ################ dataset <- generate.data(n, p, support.size, family = "ordinal", class.num = 4) abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "ordinal", tune.type = "cv" ) coef <- coef(abess_fit, support.size = abess_fit[["best.size"]])[[1]] predict(abess_fit, newx = dataset[["x"]][1:10, ], support.size = c(3, 4), type = "response" ) ########## Best group subset selection ############# dataset <- generate.data(n, p, support.size) group_index <- rep(1:10, each = 2) abess_fit <- abess(dataset[["x"]], dataset[["y"]], group.index = group_index) str(extract(abess_fit)) ################ Golden section searching ################ dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]], tune.path = "gsection") abess_fit ################ Feature screening ################ p <- 1000 dataset <- generate.data(n, p, support.size) abess_fit <- abess(dataset[["x"]], dataset[["y"]], screening.num = 100 ) str(extract(abess_fit)) ################ Sparse predictor ################ require(Matrix) p <- 1000 dataset <- generate.data(n, p, support.size) dataset[["x"]][abs(dataset[["x"]]) < 1] <- 0 dataset[["x"]] <- Matrix(dataset[["x"]]) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) str(extract(abess_fit)) ################ Formula interface ################ data("trim32") abess_fit <- abess(y ~ ., data = trim32) abess_fit
Adaptive best subset selection for principal component analysis
abesspca( x, type = c("predictor", "gram"), sparse.type = c("fpc", "kpc"), cor = FALSE, kpc.num = NULL, support.size = NULL, gs.range = NULL, tune.path = c("sequence", "gsection"), tune.type = c("gic", "aic", "bic", "ebic", "cv"), nfolds = 5, foldid = NULL, ic.scale = 1, c.max = NULL, always.include = NULL, group.index = NULL, screening.num = NULL, splicing.type = 1, max.splicing.iter = 20, warm.start = TRUE, num.threads = 0, ... )
abesspca( x, type = c("predictor", "gram"), sparse.type = c("fpc", "kpc"), cor = FALSE, kpc.num = NULL, support.size = NULL, gs.range = NULL, tune.path = c("sequence", "gsection"), tune.type = c("gic", "aic", "bic", "ebic", "cv"), nfolds = 5, foldid = NULL, ic.scale = 1, c.max = NULL, always.include = NULL, group.index = NULL, screening.num = NULL, splicing.type = 1, max.splicing.iter = 20, warm.start = TRUE, num.threads = 0, ... )
x |
A matrix object. It can be either a predictor matrix
where each row is an observation and each column is a predictor or
a sample covariance/correlation matrix.
If |
type |
If |
sparse.type |
If |
cor |
A logical value. If |
kpc.num |
A integer decide the number of principal components to be sequentially considered. |
support.size |
It is a flexible input. If it is an integer vector.
It represents the support sizes to be considered for each principal component.
If it is a |
gs.range |
A integer vector with two elements.
The first element is the minimum model size considered by golden-section,
the later one is the maximum one. Default is |
tune.path |
The method to be used to select the optimal support size. For
|
tune.type |
The type of criterion for choosing the support size.
Available options are |
nfolds |
The number of folds in cross-validation. Default is |
foldid |
an optional integer vector of values between 1, ..., nfolds identifying what fold each observation is in.
The default |
ic.scale |
A non-negative value used for multiplying the penalty term
in information criterion. Default: |
c.max |
an integer splicing size. The default of |
always.include |
An integer vector containing the indexes of variables that should always be included in the model. |
group.index |
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of |
screening.num |
An integer number. Preserve |
splicing.type |
Optional type for splicing.
If |
max.splicing.iter |
The maximum number of performing splicing algorithm.
In most of the case, only a few times of splicing iteration can guarantee the convergence.
Default is |
warm.start |
Whether to use the last solution as a warm start. Default is |
num.threads |
An integer decide the number of threads to be
concurrently used for cross-validation (i.e., |
... |
further arguments to be passed to or from methods. |
Adaptive best subset selection for principal component analysis (abessPCA) aim to solve the non-convex optimization problem:
where is support size.
Here,
is covariance matrix, i.e.,
A generic splicing technique is implemented to
solve this problem.
By exploiting the warm-start initialization, the non-convex optimization
problem at different support size (specified by support.size
)
can be efficiently solved.
The abessPCA can be conduct sequentially for each component.
Please see the multiple principal components Section on the website
for more details about this function.
For abesspca
function, the arguments kpc.num
control the number of components to be consider.
When sparse.type = "fpc"
but support.size
is not supplied,
it is set as support.size = 1:min(ncol(x), 100)
if group.index = NULL
;
otherwise, support.size = 1:min(length(unique(group.index)), 100)
.
When sparse.type = "kpc"
but support.size
is not supplied,
then for 20\
it is set as min(ncol(x), 100)
if group.index = NULL
;
otherwise, min(length(unique(group.index)), 100)
.
A S3 abesspca
class object, which is a list
with the following components:
coef |
A |
nvars |
The number of variables. |
sparse.type |
The same as input. |
support.size |
The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values. |
ev |
A vector with size |
tune.value |
A value of tuning criterion of length |
kpc.num |
The number of principal component being considered. |
var.pc |
The variance of principal components obtained by performing standard PCA. |
cum.var.pc |
Cumulative sums of |
var.all |
If |
pev |
A vector with the same length as |
pev.pc |
It records the percent of explained variance (compared to |
tune.type |
The criterion type for tuning parameters. |
tune.path |
The strategy for tuning parameters. |
call |
The original call to |
It is worthy to note that, if sparse.type == "kpc"
, the coef
, support.size
, ev
, tune.value
, pev
and pev.pc
in list are list
objects.
Some parameters not described in the Details Section is explained in the document for abess
because the meaning of these parameters are very similar.
Jin Zhu, Junxian Zhu, Ruihuang Liu, Junhao Huang, Xueqin Wang
A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117
Sparse principal component analysis. Hui Zou, Hastie Trevor, and Tibshirani Robert. Journal of computational and graphical statistics 15.2 (2006): 265-286. doi:10.1198/106186006X113430
print.abesspca
,
coef.abesspca
,
plot.abesspca
.
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) ## predictor matrix input: head(USArrests) pca_fit <- abesspca(USArrests) pca_fit plot(pca_fit) ## covariance matrix input: cov_mat <- stats::cov(USArrests) * (nrow(USArrests) - 1) / nrow(USArrests) pca_fit <- abesspca(cov_mat, type = "gram") pca_fit ## robust covariance matrix input: rob_cov <- MASS::cov.rob(USArrests)[["cov"]] rob_cov <- (rob_cov + t(rob_cov)) / 2 pca_fit <- abesspca(rob_cov, type = "gram") pca_fit ## K-component principal component analysis pca_fit <- abesspca(USArrests, sparse.type = "kpc", support.size = 1:4 ) coef(pca_fit) plot(pca_fit) plot(pca_fit, "coef") ## select support size via cross-validation ## n <- 500 p <- 50 support_size <- 3 dataset <- generate.spc.matrix(n, p, support_size, snr = 20) spca_fit <- abesspca(dataset[["x"]], tune.type = "cv", nfolds = 5) plot(spca_fit, type = "tune")
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) ## predictor matrix input: head(USArrests) pca_fit <- abesspca(USArrests) pca_fit plot(pca_fit) ## covariance matrix input: cov_mat <- stats::cov(USArrests) * (nrow(USArrests) - 1) / nrow(USArrests) pca_fit <- abesspca(cov_mat, type = "gram") pca_fit ## robust covariance matrix input: rob_cov <- MASS::cov.rob(USArrests)[["cov"]] rob_cov <- (rob_cov + t(rob_cov)) / 2 pca_fit <- abesspca(rob_cov, type = "gram") pca_fit ## K-component principal component analysis pca_fit <- abesspca(USArrests, sparse.type = "kpc", support.size = 1:4 ) coef(pca_fit) plot(pca_fit) plot(pca_fit, "coef") ## select support size via cross-validation ## n <- 500 p <- 50 support_size <- 3 dataset <- generate.spc.matrix(n, p, support_size, snr = 20) spca_fit <- abesspca(dataset[["x"]], tune.type = "cv", nfolds = 5) plot(spca_fit, type = "tune")
Decompose a matrix into the summation of low-rank matrix and sparse matrix via the best subset selection approach
abessrpca( x, rank, support.size = NULL, tune.path = c("sequence", "gsection"), gs.range = NULL, tune.type = c("gic", "aic", "bic", "ebic"), ic.scale = 1, lambda = 0, always.include = NULL, group.index = NULL, c.max = NULL, splicing.type = 2, max.splicing.iter = 1, warm.start = TRUE, important.search = NULL, max.newton.iter = 1, newton.thresh = 0.001, num.threads = 0, seed = 1, ... )
abessrpca( x, rank, support.size = NULL, tune.path = c("sequence", "gsection"), gs.range = NULL, tune.type = c("gic", "aic", "bic", "ebic"), ic.scale = 1, lambda = 0, always.include = NULL, group.index = NULL, c.max = NULL, splicing.type = 2, max.splicing.iter = 1, warm.start = TRUE, important.search = NULL, max.newton.iter = 1, newton.thresh = 0.001, num.threads = 0, seed = 1, ... )
x |
A matrix object. |
rank |
A positive integer value specify the rank of the low-rank matrix. |
support.size |
An integer vector representing the alternative support sizes.
Only used for |
tune.path |
The method to be used to select the optimal support size. For
|
gs.range |
A integer vector with two elements.
The first element is the minimum model size considered by golden-section,
the later one is the maximum one. Default is |
tune.type |
The type of criterion for choosing the support size. Available options are "gic", "ebic", "bic" and "aic". Default is "gic". |
ic.scale |
A non-negative value used for multiplying the penalty term
in information criterion. Default: |
lambda |
A single lambda value for regularized best subset selection. Default is 0. |
always.include |
An integer vector containing the indexes of variables that should always be included in the model. |
group.index |
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of |
c.max |
an integer splicing size. Default is: |
splicing.type |
Optional type for splicing.
If |
max.splicing.iter |
The maximum number of performing splicing algorithm.
In most of the case, only a few times of splicing iteration can guarantee the convergence.
Default is |
warm.start |
Whether to use the last solution as a warm start. Default is |
important.search |
An integer number indicating the number of
important variables to be splicing.
When |
max.newton.iter |
a integer giving the maximal number of Newton's iteration iterations.
Default is |
newton.thresh |
a numeric value for controlling positive convergence tolerance.
The Newton's iterations converge when |
num.threads |
An integer decide the number of threads to be
concurrently used for cross-validation (i.e., |
seed |
Seed to be used to divide the sample into cross-validation folds.
Default is |
... |
further arguments to be passed to or from methods. |
Adaptive best subset selection for robust principal component analysis aim to find two latent matrices and
such that the original matrix
can be appropriately approximated:
where is a low-rank matrix,
is a sparse matrix,
is a dense noise matrix.
Generic splicing technique can be employed to solve this problem by iteratively improve the quality of the estimation of
.
For a given support set , the optimization problem:
still a non-convex optimization problem. We use the hard-impute algorithm proposed in one of the reference to solve this problem.
The hard-impute algorithm is an iterative algorithm, people can set max.newton.iter
and newton.thresh
to
control the solution precision of the optimization problem.
(Here, the name of the two parameters are somehow abused to make the parameters cross functions have an unified name.)
According to our experiments,
we assign properly parameters to the two parameter as the default such that the precision and runtime are well balanced,
we suggest users keep the default values unchanged.
A S3 abessrpca
class object, which is a list
with the following components:
S |
A list with |
L |
The low rank matrix estimation. |
nobs |
The number of sample used for training. |
nvars |
The number of variables used for training. |
rank |
The rank of matrix |
loss |
The loss of objective function. |
tune.value |
A value of tuning criterion of length |
support.size |
The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values. |
tune.type |
The criterion type for tuning parameters. |
call |
The original call to |
Some parameters not described in the Details Section is explained in the document for abess
because the meaning of these parameters are very similar.
At present, regularization and group selection are not support,
and thus, set
lambda
and group.index
have no influence on the output.
This feature will coming soon.
A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117
Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? Journal of the ACM. 58, 3, Article 11 (May 2011), 37 pages. doi:10.1145/1970392.1970395
Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010): 2287-2322.
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) n <- 30 p <- 30 true_S_size <- 60 true_L_rank <- 2 dataset <- generate.matrix(n, p, support.size = true_S_size, rank = true_L_rank) res <- abessrpca(dataset[["x"]], rank = true_L_rank, support.size = 50:70) print(res) coef(res) plot(res, type = "tune") plot(res, type = "loss") plot(res, type = "S")
library(abess) Sys.setenv("OMP_THREAD_LIMIT" = 2) n <- 30 p <- 30 true_S_size <- 60 true_L_rank <- 2 dataset <- generate.matrix(n, p, support.size = true_S_size, rank = true_L_rank) res <- abessrpca(dataset[["x"]], rank = true_L_rank, support.size = 50:70) print(res) coef(res) plot(res, type = "tune") plot(res, type = "loss") plot(res, type = "S")
abess
" object.This function provides estimated
coefficients from a fitted "abess
" object.
## S3 method for class 'abess' coef(object, support.size = NULL, sparse = TRUE, ...)
## S3 method for class 'abess' coef(object, support.size = NULL, sparse = TRUE, ...)
object |
An " |
support.size |
An integer vector specifies
the coefficient fitted at given |
sparse |
A logical value, specifying whether the coefficients should be
presented as sparse matrix or not. Default: |
... |
Other arguments. |
A coefficient matrix when fitting an univariate model including gaussian, binomial, poisson, and cox; otherwise, a list containing coefficient matrices. For a coefficient matrix, each row is a variable, and each column is a support size.
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
abesspca
" object.This function provides estimated
coefficients from a fitted "abesspca
" object.
## S3 method for class 'abesspca' coef(object, support.size = NULL, kpc = NULL, sparse = TRUE, ...)
## S3 method for class 'abesspca' coef(object, support.size = NULL, kpc = NULL, sparse = TRUE, ...)
object |
An " |
support.size |
An integer vector specifies
the coefficient fitted at given |
kpc |
An integer vector specifies
the coefficient fitted at given principal component.
If |
sparse |
A logical value, specifying whether the coefficients should be
presented as sparse matrix or not. Default: |
... |
Other arguments. |
A matrix with length(support.size)
columns.
Each column corresponds to a sparse loading for the first principal component,
where the number of non-zeros entries depends on the support.size
.
print.abesspca
,
coef.abesspca
,
plot.abesspca
.
abessrpca
" object.This function provides estimated
coefficients from a fitted "abessrpca
" object.
## S3 method for class 'abessrpca' coef(object, support.size = NULL, sparse = TRUE, ...)
## S3 method for class 'abessrpca' coef(object, support.size = NULL, sparse = TRUE, ...)
object |
An " |
support.size |
An integer vector specifies
the sparse matrix fitted at given |
sparse |
A logical value, specifying whether the coefficients should be
presented as sparse matrix or not. Default: |
... |
Other arguments. |
A list with length(support.size)
number of dgCMatrix,
each of which is the estimation the sparse component.
abess
" object.Similar to other deviance methods,
which returns deviance from a fitted "abess
" object.
## S3 method for class 'abess' deviance(object, type = c("standard", "gic", "ebic", "bic", "aic"), ...)
## S3 method for class 'abess' deviance(object, type = c("standard", "gic", "ebic", "bic", "aic"), ...)
object |
A " |
type |
The type of deviance.
One of the following: |
... |
additional arguments |
A numeric vector.
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
abess
" object.Extract the fixed-support-size model's information such as the selected predictors, coefficient estimation, and so on.
extract(object, support.size = NULL, ...) ## S3 method for class 'abess' extract(object, support.size = NULL, ...)
extract(object, support.size = NULL, ...) ## S3 method for class 'abess' extract(object, support.size = NULL, ...)
object |
An " |
support.size |
An integer value specifies
the model size fitted at given |
... |
Other arguments. |
A list
object including the following components:
beta |
A |
intercept |
The fitted intercept value. |
support.size |
The |
support.beta |
The |
support.vars |
The character vector gives variables in the support set. |
tune.value |
The tuning value of the model. |
dev |
The deviance of the model. |
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
Generate simulated data under the generalized linear model and Cox proportional hazard model.
generate.data( n, p, support.size = NULL, rho = 0, family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial", "gamma", "ordinal"), beta = NULL, cortype = 1, snr = 10, sigma = NULL, weibull.shape = 1, uniform.max = 1, y.dim = 3, class.num = 3, seed = 1 )
generate.data( n, p, support.size = NULL, rho = 0, family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial", "gamma", "ordinal"), beta = NULL, cortype = 1, snr = 10, sigma = NULL, weibull.shape = 1, uniform.max = 1, y.dim = 3, class.num = 3, seed = 1 )
n |
The number of observations. |
p |
The number of predictors of interest. |
support.size |
The number of nonzero coefficients in the underlying regression
model. Can be omitted if |
rho |
A parameter used to characterize the pairwise correlation in
predictors. Default is |
family |
The distribution of the simulated response. |
beta |
The coefficient values in the underlying regression model.
If it is supplied, |
cortype |
The correlation structure.
|
snr |
A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as
as the variance of |
sigma |
The variance of the gaussian noise. Default |
weibull.shape |
The shape parameter of the Weibull distribution.
It works only when |
uniform.max |
A parameter controlling censored rate.
A large value implies a small censored rate;
otherwise, a large censored rate.
It works only when |
y.dim |
Response's Dimension. It works only when |
class.num |
The number of class. It works only when |
seed |
random seed. Default: |
For family = "gaussian"
, the data model is
The underlying regression coefficient has
uniform distribution [m, 100m] and
For family= "binomial"
, the data model is
The underlying regression coefficient has
uniform distribution [2m, 10m] and
For family = "poisson"
, the data is modeled to have
an exponential distribution:
The underlying regression coefficient has
uniform distribution [2m, 10m] and
For family = "gamma"
, the data is modeled to have
a gamma distribution:
where is shape parameter in a gamma distribution.
The underlying regression coefficient
has
uniform distribution [2m, 100m] and
For family = "ordinal"
, the data is modeled to have
an ordinal distribution.
For family = "cox"
, the model for failure time is
where is a uniform random variable with range [0, 1].
The centering time
is generated from
uniform distribution
,
then we define the censor status as
and observed time as
.
The underlying regression coefficient
has
uniform distribution [2m, 10m],
where
.
For family = "mgaussian"
, the data model is
The non-zero values of regression matrix are sampled from
uniform distribution [m, 100m] and
For family= "multinomial"
, the data model is
The non-zero values of regression coefficient has
uniform distribution [2m, 10m] and
In the above models, and
,
where
is determined by the
snr
and q is y.dim
.
A list
object comprising:
x |
Design matrix of predictors. |
y |
Response variable. |
beta |
The coefficients used in the underlying regression model. |
Jin Zhu
# Generate simulated data n <- 200 p <- 20 support.size <- 5 dataset <- generate.data(n, p, support.size) str(dataset)
# Generate simulated data n <- 200 p <- 20 support.size <- 5 dataset <- generate.data(n, p, support.size) str(dataset)
Generate simulated matrix that is the superposition of a low-rank component and a sparse component.
generate.matrix( n, p, rank = NULL, support.size = NULL, beta = NULL, snr = Inf, sigma = NULL, seed = 1 )
generate.matrix( n, p, rank = NULL, support.size = NULL, beta = NULL, snr = Inf, sigma = NULL, seed = 1 )
n |
The number of observations. |
p |
The number of predictors of interest. |
rank |
The rank of low-rank matrix. |
support.size |
The number of nonzero coefficients in the underlying regression
model. Can be omitted if |
beta |
The coefficient values in the underlying regression model.
If it is supplied, |
snr |
A positive value controlling the signal-to-noise ratio (SNR).
A larger SNR implies the identification of sparse matrix is much easier.
Default |
sigma |
A numerical value supplied the variance of the gaussian noise.
Default |
seed |
random seed. Default: |
The low rank matrix is generated by
, where
is an
-by-
matrix and
is a
-by-
matrix.
Each element in
(or
) are i.i.d. drawn from
.
The sparse matrix is an
-by-
matrix.
It is generated by choosing a support set of size
support.size
uniformly at random.
The non-zero entries in are independent Bernoulli (-1, +1) entries.
The noise matrix is an
-by-
matrix,
the elements in
are i.i.d. gaussian random variable
with standard deviation
.
The SNR is defined as
as the variance of vectorized matrix divided
by
.
The matrix is the superposition of
,
,
:
A list
object comprising:
x |
An |
L |
The latent low rank matrix. |
S |
The latent sparse matrix. |
Jin Zhu
# Generate simulated data n <- 30 p <- 20 dataset <- generate.matrix(n, p) stats::heatmap(as.matrix(dataset[["S"]]), Rowv = NA, Colv = NA, scale = "none", col = grDevices::cm.colors(256), frame.plot = TRUE, margins = c(2.4, 2.4) )
# Generate simulated data n <- 30 p <- 20 dataset <- generate.matrix(n, p) stats::heatmap(as.matrix(dataset[["S"]]), Rowv = NA, Colv = NA, scale = "none", col = grDevices::cm.colors(256), frame.plot = TRUE, margins = c(2.4, 2.4) )
Generate simulated matrix that its principal component are sparse linear combination of its columns.
generate.spc.matrix( n, p, support.size = 3, snr = 20, sigma = NULL, sparse.loading = NULL, seed = 1 )
generate.spc.matrix( n, p, support.size = 3, snr = 20, sigma = NULL, sparse.loading = NULL, seed = 1 )
n |
The number of observations. |
p |
The number of predictors of interest. |
support.size |
A integer specify the number of non-zero entries in the first column of loading matrix. |
snr |
A positive value controlling the signal-to-noise ratio (SNR).
A larger SNR implies the identification of sparse matrix is much easier.
Default |
sigma |
A numerical vector with length |
sparse.loading |
A |
seed |
random seed. Default: |
The methods for generating the matrix is detailedly described in the APPENDIX A: Data generation Section in Schipper et al (2021).
A list
object comprising:
x |
An |
coef |
The sparse loading matrix used to generate x. |
support.size |
A vector recording the number of non-zero entries in each . |
Model selection techniques for sparse weight-based principal component analysis. de Schipper, Niek C and Van Deun, Katrijn. Journal of Chemometrics. 2021. doi:10.1002/cem.3289.
abess
" objectProduces a coefficient/deviance/tuning-value plot for a fitted "abess" object.
## S3 method for class 'abess' plot( x, type = c("coef", "l2norm", "dev", "dev.ratio", "tune"), label = FALSE, ... )
## S3 method for class 'abess' plot( x, type = c("coef", "l2norm", "dev", "dev.ratio", "tune"), label = FALSE, ... )
x |
A " |
type |
The type of terms to be plot in the y-axis.
One of the following: |
label |
A logical value.
If |
... |
Other graphical parameters to plot |
No return value, called for side effects.
If family = "mgaussian"
, family = "ordinal"
or family = "multinomial"
,
a coefficient plot is produced for
each dimension of multivariate response.
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
dataset <- generate.data(100, 20, 3) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) plot(abess_fit) plot(abess_fit, type = "l2norm") plot(abess_fit, type = "dev") plot(abess_fit, type = "tune")
dataset <- generate.data(100, 20, 3) abess_fit <- abess(dataset[["x"]], dataset[["y"]]) plot(abess_fit) plot(abess_fit, type = "l2norm") plot(abess_fit, type = "dev") plot(abess_fit, type = "tune")
abess
" objectProduces a coefficient/deviance/tuning-value plot for a fitted "abess" object.
## S3 method for class 'abesspca' plot(x, type = c("pev", "coef", "tune"), label = FALSE, ...)
## S3 method for class 'abesspca' plot(x, type = c("pev", "coef", "tune"), label = FALSE, ...)
x |
A " |
type |
The type of terms to be plot in the y-axis.
One of the following:
|
label |
A logical value.
If |
... |
Other graphical parameters to plot |
No return value, called for side effects.
If family = "mgaussian"
or family = "multinomial"
,
a coefficient plot is produced for
each dimension of multivariate response.
print.abesspca
,
coef.abesspca
,
plot.abesspca
.
abess_fit <- abesspca(USArrests, support.size = 1:4, sparse.type = "kpc") plot(abess_fit) plot(abess_fit, type = "coef") plot(abess_fit, type = "tune")
abess_fit <- abesspca(USArrests, support.size = 1:4, sparse.type = "kpc") plot(abess_fit) plot(abess_fit, type = "coef") plot(abess_fit, type = "tune")
abessrpca
" objectProduces a sparse-matrix/loss/tuning-value plot
for a fitted "abessrpca
" object.
## S3 method for class 'abessrpca' plot(x, type = c("S", "loss", "tune"), support.size = NULL, label = TRUE, ...)
## S3 method for class 'abessrpca' plot(x, type = c("S", "loss", "tune"), support.size = NULL, label = TRUE, ...)
x |
A " |
type |
The plot type.
One of the following:
|
support.size |
An integer vector specifies
the sparse matrix fitted at given |
label |
A logical value.
If |
... |
Other graphical parameters to |
No return value, called for side effects.
abess
" object.Make predictions from a fitted "abess
" object.
## S3 method for class 'abess' predict(object, newx, type = c("link", "response"), support.size = NULL, ...)
## S3 method for class 'abess' predict(object, newx, type = c("link", "response"), support.size = NULL, ...)
object |
An " |
newx |
New data used for prediction. If omitted, the fitted linear predictors are used. |
type |
|
support.size |
An integer value specifies
the model size fitted at given |
... |
Additional arguments affecting the predictions produced. |
The object returned depends on the types of family.
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
abess
" objectPrints the fitted model and returns it invisibly.
## S3 method for class 'abess' print(x, digits = max(5, getOption("digits") - 5), ...)
## S3 method for class 'abess' print(x, digits = max(5, getOption("digits") - 5), ...)
x |
A " |
digits |
Minimum number of significant digits to be used. |
... |
additional print arguments |
Print a data.frame
with three columns:
the first column is support size of model;
the second column is deviance of model;
the last column is the tuning value of the certain tuning type.
No return value, called for side effects
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
abesspca
" objectPrints the fitted model and returns it invisibly.
## S3 method for class 'abesspca' print(x, digits = max(5, getOption("digits") - 5), ...)
## S3 method for class 'abesspca' print(x, digits = max(5, getOption("digits") - 5), ...)
x |
A " |
digits |
Minimum number of significant digits to be used. |
... |
additional print arguments |
Print a data.frame
with three columns:
the first column is support size of model;
the second column is the explained variance of model;
the last column is the percent of explained variance of model.
No return value, called for side effects
print.abesspca
,
coef.abesspca
,
plot.abesspca
.
abessrpca
" objectPrints the fitted model and returns it invisibly.
## S3 method for class 'abessrpca' print(x, digits = max(5, getOption("digits") - 5), ...)
## S3 method for class 'abessrpca' print(x, digits = max(5, getOption("digits") - 5), ...)
x |
A " |
digits |
Minimum number of significant digits to be used. |
... |
additional print arguments |
Print a data.frame
with three columns:
the first column is support size of model;
the second column is the explained variance of model;
the last column is the percent of explained variance of model.
No return value, called for side effects
Gene expression data (500 gene probes for 120 samples) from the microarray experiments of mammalianeye tissue samples of Scheetz et al. (2006).
A data frame with 120 rows and 501 variables, where the first variable is the expression level of TRIM32 gene, and the remaining 500 variables are 500 gene probes.
In this study, laboratory rats (Rattus norvegicus) were studied to learn about gene expression and regulation in the mammalian eye. Inbred rat strains were crossed and tissue extracted from the eyes of 120 animals from the F2 generation. Microarrays were used to measure levels of RNA expression in the isolated eye tissues of each subject. Of the 31,000 different probes, 18,976 were detected at a sufficient level to be considered expressed in the mammalian eye. For the purposes of this analysis, we treat one of those genes, Trim32, as the outcome. Trim32 is known to be linked with a genetic disorder called Bardet-Biedl Syndrome (BBS): the mutation (P130S) in Trim32 gives rise to BBS.
This data set contains 120 samples with 500 predictors. The 500 predictors are features with maximum marginal correlation to Trim32 gene.
T. Scheetz, k. Kim, R. Swiderski, A. Philp, T. Braun, K. Knudtson, A. Dorrance, G. DiBona, J. Huang, T. Casavant, V. Sheffield, E. Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences of the United States of America, 2006.