Title: | Unified Algorithm for Non-convex Penalized Estimation for Generalized Linear Models |
---|---|
Description: | An efficient unified nonconvex penalized estimation algorithm for Gaussian (linear), binomial Logit (logistic), Poisson, multinomial Logit, and Cox proportional hazard regression models. The unified algorithm is implemented based on the convex concave procedure and the algorithm can be applied to most of the existing nonconvex penalties. The algorithm also supports convex penalty: least absolute shrinkage and selection operator (LASSO). Supported nonconvex penalties include smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), truncated LASSO penalty (TLP), clipped LASSO (CLASSO), sparse ridge (SRIDGE), modified bridge (MBRIDGE) and modified log (MLOG). For high-dimensional data (data set with many variables), the algorithm selects relevant variables producing a parsimonious regression model. Kim, D., Lee, S. and Kwon, S. (2018) <arXiv:1811.05061>, Lee, S., Kwon, S. and Kim, Y. (2016) <doi:10.1016/j.csda.2015.08.019>, Kwon, S., Lee, S. and Kim, Y. (2015) <doi:10.1016/j.csda.2015.07.001>. (This research is funded by Julian Virtue Professorship from Center for Applied Research at Pepperdine Graziadio Business School and the National Research Foundation of Korea.) |
Authors: | Dongshin Kim [aut, cre, cph], Sunghoon Kwon [aut, cph], Sangin Lee [aut, cph] |
Maintainer: | Dongshin Kim <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-12-22 06:29:35 UTC |
Source: | CRAN |
This package fits the generalized linear models with various non-convex penalties.
Supported regression models are Gaussian (linear), binomial Logit (logistic), multinomial Logit,
Poisson and Cox proportional hazard.
A unified algorithm is implemented in ncpen based on the convex concave procedure
or difference convex algorithm that can be applied to most of existing non-convex penalties.
The available penalties in the package are
the least absolute shrinkage and selection operator(LASSO),
smoothly clipped absolute deviation (SCAD),
minimax concave penalty (MCP),
truncated -penalty (TLP),
clipped LASSO (CLASSO),
sparse bridge (SRIDGE),
modified bridge (MBRIDGE),
and modified log (MLOG) penalties.
The package accepts a design matrix and vector of responses
,
and produces the regularization path over a grid of values for the tuning parameter
lambda
.
Also provides user-friendly processes for plotting, selecting tuning parameters using cross-validation or generalized information criterion (GIC),
-regularization, penalty weights, standardization and intercept.
This research is funded by Julian Virtue Professorship from Center for Applied Research at Pepperdine Graziadio Business School and the National Research Foundation of Korea.
Dongshin Kim, Sunghoon Kwon and Sangin Lee
Kim, D., Lee, S. and Kwon, S. (2018). A unified algorithm for the non-convex penalized estimation: The ncpen
package.
http://arxiv.org/abs/1811.05061.
Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67.
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
Choi, H., Kim, Y. and Kwon, S. (2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242.
cv.ncpen
.The function returns the optimal vector of coefficients.
## S3 method for class 'cv.ncpen' coef(object, type = c("rmse", "like"), ...)
## S3 method for class 'cv.ncpen' coef(object, type = c("rmse", "like"), ...)
object |
(cv.ncpen object) fitted |
type |
(character) a cross-validated error type which is either |
... |
other S3 parameters. Not used.
Each error type is defined in |
the optimal coefficients vector selected by cross-validation.
type |
error type. |
lambda |
the optimal lambda selected by CV. |
beta |
the optimal coefficients selected by CV. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
cv.ncpen
, plot.cv.ncpen
, gic.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10) coef(fit) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="binomial",penalty="classo") coef(fit) ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=10,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="multinomial",penalty="sridge") coef(fit)
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10) coef(fit) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="binomial",penalty="classo") coef(fit) ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=10,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="multinomial",penalty="sridge") coef(fit)
ncpen
objectThe function returns the coefficients matrix for all lambda values.
## S3 method for class 'ncpen' coef(object, ...)
## S3 method for class 'ncpen' coef(object, ...)
object |
(ncpen object) fitted |
... |
other S3 parameters. Not used. |
beta |
The coefficients matrix or list for |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) coef(fit) ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") coef(fit)
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) coef(fit) ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") coef(fit)
ncpen
.The function returns controlled samples and tuning parameters for ncpen
by eliminating unnecessary errors.
control.ncpen(y.vec, x.mat, family = c("gaussian", "binomial", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, aiter.max = 100, b.eps = 1e-07)
control.ncpen(y.vec, x.mat, family = c("gaussian", "binomial", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, aiter.max = 100, b.eps = 1e-07)
y.vec |
(numeric vector) response vector.
Must be 0,1 for |
x.mat |
(numeric matrix) design matrix without intercept.
The censoring indicator must be included at the last column of the design matrix for |
family |
(character) regression model. Supported models are
|
penalty |
(character) penalty function.
Supported penalties are
|
x.standardize |
(logical) whether to standardize |
intercept |
(logical) whether to include an intercept in the model. |
lambda |
(numeric vector) user-specified sequence of |
n.lambda |
(numeric) the number of |
r.lambda |
(numeric) ratio of the smallest |
w.lambda |
(numeric vector) penalty weights for each coefficient (see references). If a penalty weight is set to 0, the corresponding coefficient is always nonzero. |
gamma |
(numeric) additional tuning parameter for controlling shrinkage effect of |
tau |
(numeric) concavity parameter of the penalties (see reference).
Default is 3.7 for |
alpha |
(numeric) ridge effect (weight between the penalty and ridge penalty) (see details).
Default value is 1. If penalty is |
aiter.max |
(numeric) maximum number of iterations in CD algorithm. |
b.eps |
(numeric) convergence threshold for coefficients vector. |
The function is used internal purpose but useful when users want to extract proper tuning parameters for ncpen
.
Do not supply the samples from control.ncpen
into ncpen
or cv.ncpen
directly to avoid unexpected errors.
An object with S3 class ncpen
.
y.vec |
response vector. |
x.mat |
design matrix adjusted to supplied options such as family and intercept. |
family |
regression model. |
penalty |
penalty. |
x.standardize |
whether to standardize |
intercept |
whether to include the intercept. |
std |
scale factor for |
lambda |
lambda values for the analysis. |
n.lambda |
the number of |
r.lambda |
ratio of the smallest |
w.lambda |
penalty weights for each coefficient. |
gamma |
additional tuning parameter for controlling shrinkage effect of |
tau |
concavity parameter of the penalties (see references). |
alpha |
ridge effect (amount of ridge penalty). see details. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96, 1348-60. Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807-832. Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67. Kwon, S. Kim, Y. and Choi, H.(2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242. Huang, J., Horowitz, J.L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2), 587-613. Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509. Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,tau=1) tun$tau ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=10,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10, family="multinomial",penalty="sridge",gamma=10) ### cox regression with mcp penalty sam = sam.gen.ncpen(n=200,p=10,q=5,r=0.2,cf.min=0.5,cf.max=1,corr=0.5,family="cox") x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="cox",penalty="scad")
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,tau=1) tun$tau ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=10,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10, family="multinomial",penalty="sridge",gamma=10) ### cox regression with mcp penalty sam = sam.gen.ncpen(n=200,p=10,q=5,r=0.2,cf.min=0.5,cf.max=1,corr=0.5,family="cox") x.mat = sam$x.mat; y.vec = sam$y.vec tun = control.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="cox",penalty="scad")
ncpen
performs k-fold cross-validation (CV) for nonconvex penalized regression models
over a sequence of the regularization parameter lambda
.
cv.ncpen(y.vec, x.mat, family = c("gaussian", "linear", "binomial", "logit", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-06, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL, n.fold = 10, fold.id = NULL)
cv.ncpen(y.vec, x.mat, family = c("gaussian", "linear", "binomial", "logit", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-06, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL, n.fold = 10, fold.id = NULL)
y.vec |
(numeric vector) response vector.
Must be 0,1 for |
x.mat |
(numeric matrix) design matrix without intercept.
The censoring indicator must be included at the last column of the design matrix for |
family |
(character) regression model. Supported models are
|
penalty |
(character) penalty function.
Supported penalties are
|
x.standardize |
(logical) whether to standardize |
intercept |
(logical) whether to include an intercept in the model. |
lambda |
(numeric vector) user-specified sequence of |
n.lambda |
(numeric) the number of |
r.lambda |
(numeric) ratio of the smallest |
w.lambda |
(numeric vector) penalty weights for each coefficient (see references). If a penalty weight is set to 0, the corresponding coefficient is always nonzero. |
gamma |
(numeric) additional tuning parameter for controlling shrinkage effect of |
tau |
(numeric) concavity parameter of the penalties (see reference).
Default is 3.7 for |
alpha |
(numeric) ridge effect (weight between the penalty and ridge penalty) (see details).
Default value is 1. If penalty is |
df.max |
(numeric) the maximum number of nonzero coefficients. |
cf.max |
(numeric) the maximum of absolute value of nonzero coefficients. |
proj.min |
(numeric) the projection cycle inside CD algorithm (largely internal use. See details). |
add.max |
(numeric) the maximum number of variables added in CCCP iterations (largely internal use. See references). |
niter.max |
(numeric) maximum number of iterations in CCCP. |
qiter.max |
(numeric) maximum number of quadratic approximations in each CCCP iteration. |
aiter.max |
(numeric) maximum number of iterations in CD algorithm. |
b.eps |
(numeric) convergence threshold for coefficients vector. |
k.eps |
(numeric) convergence threshold for KKT conditions. |
c.eps |
(numeric) convergence threshold for KKT conditions (largely internal use). |
cut |
(logical) convergence threshold for KKT conditions (largely internal use). |
local |
(logical) whether to use local initial estimator for path construction. It may take a long time. |
local.initial |
(numeric vector) initial estimator for |
n.fold |
(numeric) number of folds for CV. |
fold.id |
(numeric vector) fold ids from 1 to k that indicate fold configuration. |
Two kinds of CV errors are returned: root mean squared error and negative log likelihood.
The results depends on the random partition made internally.
To choose an optimal coefficients form the cv results, use coef.cv.ncpen
.
ncpen
does not search values of gamma
, tau
and alpha
.
An object with S3 class cv.ncpen
.
ncpen.fit |
ncpen object fitted from the whole samples. |
fold.index |
fold ids of the samples. |
rmse |
rood mean squared errors from CV. |
like |
negative log-likelihoods from CV. |
lambda |
sequence of |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96, 1348-60. Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807-832. Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67. Kwon, S. Kim, Y. and Choi, H.(2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242. Huang, J., Horowitz, J.L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2), 587-613. Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509. Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
plot.cv.ncpen
, coef.cv.ncpen
, ncpen
, predict.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="gaussian", penalty="scad") coef(fit)
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="gaussian", penalty="scad") coef(fit)
ncpen
performs k-fold cross-validation (CV) for nonconvex penalized regression models
over a sequence of the regularization parameter lambda
.
cv.ncpen.reg(formula, data, family = c("gaussian", "linear", "binomial", "logit", "multinomial", "cox", "poisson"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-06, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL, n.fold = 10, fold.id = NULL)
cv.ncpen.reg(formula, data, family = c("gaussian", "linear", "binomial", "logit", "multinomial", "cox", "poisson"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-06, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL, n.fold = 10, fold.id = NULL)
formula |
(formula) regression formula. To include/exclude intercept, use |
data |
(numeric matrix or data.frame) contains both y and X. Each row is an observation vector.
The censoring indicator must be included at the last column of the data for |
family |
(character) regression model. Supported models are
|
penalty |
(character) penalty function.
Supported penalties are
|
x.standardize |
(logical) whether to standardize |
intercept |
(logical) whether to include an intercept in the model. |
lambda |
(numeric vector) user-specified sequence of |
n.lambda |
(numeric) the number of |
r.lambda |
(numeric) ratio of the smallest |
w.lambda |
(numeric vector) penalty weights for each coefficient (see references). If a penalty weight is set to 0, the corresponding coefficient is always nonzero. |
gamma |
(numeric) additional tuning parameter for controlling shrinkage effect of |
tau |
(numeric) concavity parameter of the penalties (see reference).
Default is 3.7 for |
alpha |
(numeric) ridge effect (weight between the penalty and ridge penalty) (see details).
Default value is 1. If penalty is |
df.max |
(numeric) the maximum number of nonzero coefficients. |
cf.max |
(numeric) the maximum of absolute value of nonzero coefficients. |
proj.min |
(numeric) the projection cycle inside CD algorithm (largely internal use. See details). |
add.max |
(numeric) the maximum number of variables added in CCCP iterations (largely internal use. See references). |
niter.max |
(numeric) maximum number of iterations in CCCP. |
qiter.max |
(numeric) maximum number of quadratic approximations in each CCCP iteration. |
aiter.max |
(numeric) maximum number of iterations in CD algorithm. |
b.eps |
(numeric) convergence threshold for coefficients vector. |
k.eps |
(numeric) convergence threshold for KKT conditions. |
c.eps |
(numeric) convergence threshold for KKT conditions (largely internal use). |
cut |
(logical) convergence threshold for KKT conditions (largely internal use). |
local |
(logical) whether to use local initial estimator for path construction. It may take a long time. |
local.initial |
(numeric vector) initial estimator for |
n.fold |
(numeric) number of folds for CV. |
fold.id |
(numeric vector) fold ids from 1 to k that indicate fold configuration. |
Two kinds of CV errors are returned: root mean squared error and negative log likelihood.
The results depends on the random partition made internally.
To choose an optimal coefficients form the cv results, use coef.cv.ncpen
.
ncpen
does not search values of gamma
, tau
and alpha
.
An object with S3 class cv.ncpen
.
ncpen.fit |
ncpen object fitted from the whole samples. |
fold.index |
fold ids of the samples. |
rmse |
rood mean squared errors from CV. |
like |
negative log-likelihoods from CV. |
lambda |
sequence of |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96, 1348-60. Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807-832. Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67. Kwon, S. Kim, Y. and Choi, H.(2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242. Huang, J., Horowitz, J.L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2), 587-613. Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509. Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
plot.cv.ncpen
, coef.cv.ncpen
, ncpen
, predict.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=5,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec data = cbind(y.vec, x.mat) colnames(data) = c("y", paste("xv", 1:ncol(x.mat), sep = "")) fit1 = cv.ncpen.reg(formula = y ~ xv1 + xv2 + xv3 + xv4 + xv5, data = data, n.lambda=10, family="gaussian", penalty="scad") fit2 = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="gaussian", penalty="scad") coef(fit1)
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=5,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec data = cbind(y.vec, x.mat) colnames(data) = c("y", paste("xv", 1:ncol(x.mat), sep = "")) fit1 = cv.ncpen.reg(formula = y ~ xv1 + xv2 + xv3 + xv4 + xv5, data = data, n.lambda=10, family="gaussian", penalty="scad") fit2 = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=10,family="gaussian", penalty="scad") coef(fit1)
This is internal use only function.
excluded(excluded.pair, a, b)
excluded(excluded.pair, a, b)
excluded.pair |
a pair. |
a |
first column to be compared. |
b |
second column to be compared. |
TRUE if excluded, FALSE otherwise.
cv.ncpen
.The function returns fold configuration of the samples for CV.
fold.cv.ncpen(c.vec, n.fold = 10, family = c("gaussian", "binomial", "multinomial", "cox", "poisson"))
fold.cv.ncpen(c.vec, n.fold = 10, family = c("gaussian", "binomial", "multinomial", "cox", "poisson"))
c.vec |
(numeric vector) vector for construction of CV ids:
censoring indicator for |
n.fold |
(numeric) number of folds for CV. |
family |
(character) regression model. Supported models are
|
fold ids of the samples.
idx |
fold ids. |
n.fold |
the number of folds. |
family |
the model. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
cv.ncpen
, plot.cv.ncpen
, gic.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="binomial") ### poison regression with mlog penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="poisson") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="poisson") ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="multinomial") ### cox regression with mcp penalty sam = sam.gen.ncpen(n=200,p=20,q=5,r=0.2,cf.min=0.5,cf.max=1,corr=0.5,family="cox") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=x.mat[,21],n.fold=10,family="cox")
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="binomial") ### poison regression with mlog penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="poisson") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="poisson") ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=y.vec,n.fold=10,family="multinomial") ### cox regression with mcp penalty sam = sam.gen.ncpen(n=200,p=20,q=5,r=0.2,cf.min=0.5,cf.max=1,corr=0.5,family="cox") x.mat = sam$x.mat; y.vec = sam$y.vec fold.id = fold.cv.ncpen(c.vec=x.mat[,21],n.fold=10,family="cox")
The function provides the selection of the regularization parameter lambda based on the GIC including AIC and BIC.
gic.ncpen(fit, weight = NULL, verbose = TRUE, ...)
gic.ncpen(fit, weight = NULL, verbose = TRUE, ...)
fit |
(ncpen object) fitted |
weight |
(numeric) the weight factor for various information criteria.
Default is BIC if |
verbose |
(logical) whether to plot the GIC curve. |
... |
other graphical parameters to |
User can supply various weight
values (see references). For example,
weight=2
,
weight=log(n)
,
weight=log(log(p))log(n)
,
weight=log(log(n))log(p)
,
corresponds to AIC, BIC (fixed dimensional model), modified BIC (diverging dimensional model) and GIC (high dimensional model).
The coefficients matrix
.
gic |
the GIC values. |
lambda |
the sequence of lambda values used to calculate GIC. |
opt.beta |
the optimal coefficients selected by GIC. |
opt.lambda |
the optimal lambda value. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Wang, H., Li, R. and Tsai, C.L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553-568. Wang, H., Li, B. and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 671-683. Kim, Y., Kwon, S. and Choi, H. (2012). Consistent Model Selection Criteria on High Dimensions. Journal of Machine Learning Research, 13, 1037-1057. Fan, Y. and Tang, C.Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 531-552. Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) gic.ncpen(fit,pch="*",type="b") ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") gic.ncpen(fit,pch="*",type="b")
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) gic.ncpen(fit,pch="*",type="b") ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") gic.ncpen(fit,pch="*",type="b")
interact.data
interacts all the data in a data.frame
or matrix
.
interact.data(data, base.cols = NULL, exclude.pair = NULL)
interact.data(data, base.cols = NULL, exclude.pair = NULL)
data |
a |
base.cols |
indicates columns from one category.
Interactions among variables from a same base.col will be avoided. For example, if three indicator columns,
"ChannelR", "ChannelC" and "ChannelB", are created from a categorical column "Channel", then the interaction among them
can be excluded by assigning |
exclude.pair |
the pairs will be excluded from interactions. This should be a |
This returns an object of matrix
which contains interactions.
df = data.frame(1:3, 4:6, 7:9, 10:12, 13:15); colnames(df) = c("aa", "bb", "cc", "dd", "aa2"); df interact.data(df); interact.data(df, base.cols = "aa"); interact.data(df, base.cols = "aa", exclude.pair = list(c("bb", "cc")));
df = data.frame(1:3, 4:6, 7:9, 10:12, 13:15); colnames(df) = c("aa", "bb", "cc", "dd", "aa2"); df interact.data(df); interact.data(df, base.cols = "aa"); interact.data(df, base.cols = "aa", exclude.pair = list(c("bb", "cc")));
This function creates ncpen y vector
and x matrix
from data using formula.
make.ncpen.data(formula, data)
make.ncpen.data(formula, data)
formula |
(formula) regression formula. Intercept will not be created. |
data |
(numeric matrix or data.frame) contains both y and X. |
List of y vector and x matrix.
y.vec |
y |
x.mat |
x |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
data = data.frame(y = 1:5, x1 = 6:10, x2 = 11:15); formula = log(y) ~ log(x1) + x2; make.ncpen.data(formula, data);
data = data.frame(y = 1:5, x1 = 6:10, x2 = 11:15); formula = log(y) ~ log(x1) + x2; make.ncpen.data(formula, data);
This is internal use only function. Manual left blank on purpose.
native_cpp_ncpen_fun_(y_vec, x_mat0, w_vec0, lam_vec0, gam, tau, alp, d_max, iter_max, qiter_max, qiiter_max, b_eps, k_eps, p_eff, cut, c_eps, add, family, penalty, loc, ob_vec, div)
native_cpp_ncpen_fun_(y_vec, x_mat0, w_vec0, lam_vec0, gam, tau, alp, d_max, iter_max, qiter_max, qiiter_max, b_eps, k_eps, p_eff, cut, c_eps, add, family, penalty, loc, ob_vec, div)
y_vec |
. |
x_mat0 |
. |
w_vec0 |
. |
lam_vec0 |
. |
gam |
. |
tau |
. |
alp |
. |
d_max |
. |
iter_max |
. |
qiter_max |
. |
qiiter_max |
. |
b_eps |
. |
k_eps |
. |
p_eff |
. |
cut |
. |
c_eps |
. |
add |
. |
family |
. |
penalty |
. |
loc |
. |
ob_vec |
. |
div |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_nr_fun_(fam, y_vec, x_mat, iter_max, b_eps)
native_cpp_nr_fun_(fam, y_vec, x_mat, iter_max, b_eps)
fam |
. |
y_vec |
. |
x_mat |
. |
iter_max |
. |
b_eps |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_obj_fun_(name, y_vec, x_mat, b_vec)
native_cpp_obj_fun_(name, y_vec, x_mat, b_vec)
name |
. |
y_vec |
. |
x_mat |
. |
b_vec |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_obj_grad_fun_(name, y_vec, x_mat, b_vec)
native_cpp_obj_grad_fun_(name, y_vec, x_mat, b_vec)
name |
. |
y_vec |
. |
x_mat |
. |
b_vec |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_obj_hess_fun_(name, y_vec, x_mat, b_vec)
native_cpp_obj_hess_fun_(name, y_vec, x_mat, b_vec)
name |
. |
y_vec |
. |
x_mat |
. |
b_vec |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_p_ncpen_fun_(y_vec, x_mat, b_vec, w_vec, lam, gam, tau, alp, iter_max, qiter_max, qiiter_max, b_eps, k_eps, p_eff, cut, c_eps, family, penalty)
native_cpp_p_ncpen_fun_(y_vec, x_mat, b_vec, w_vec, lam, gam, tau, alp, iter_max, qiter_max, qiiter_max, b_eps, k_eps, p_eff, cut, c_eps, family, penalty)
y_vec |
. |
x_mat |
. |
b_vec |
. |
w_vec |
. |
lam |
. |
gam |
. |
tau |
. |
alp |
. |
iter_max |
. |
qiter_max |
. |
qiiter_max |
. |
b_eps |
. |
k_eps |
. |
p_eff |
. |
cut |
. |
c_eps |
. |
family |
. |
penalty |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_pen_fun_(name, b_vec, lam, gam, tau)
native_cpp_pen_fun_(name, b_vec, lam, gam, tau)
name |
. |
b_vec |
. |
lam |
. |
gam |
. |
tau |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_pen_grad_fun_(name, b_vec, lam, gam, tau)
native_cpp_pen_grad_fun_(name, b_vec, lam, gam, tau)
name |
. |
b_vec |
. |
lam |
. |
gam |
. |
tau |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_qlasso_fun_(q_mat, l_vec, b_vec0, w_vec, lam, iter_max, iiter_max, b_eps, k_eps, p_eff, q_rank, cut, c_eps)
native_cpp_qlasso_fun_(q_mat, l_vec, b_vec0, w_vec, lam, iter_max, iiter_max, b_eps, k_eps, p_eff, q_rank, cut, c_eps)
q_mat |
. |
l_vec |
. |
b_vec0 |
. |
w_vec |
. |
lam |
. |
iter_max |
. |
iiter_max |
. |
b_eps |
. |
k_eps |
. |
p_eff |
. |
q_rank |
. |
cut |
. |
c_eps |
. |
.
This is internal use only function. Manual left blank on purpose.
native_cpp_set_dev_mode_(dev_mode)
native_cpp_set_dev_mode_(dev_mode)
dev_mode |
. |
.
Fits generalized linear models by penalized maximum likelihood estimation.
The coefficients path is computed for the regression model over a grid of the regularization parameter lambda
.
Fits Gaussian (linear), binomial Logit (logistic), Poisson, multinomial Logit regression models, and
Cox proportional hazard model with various non-convex penalties.
ncpen(y.vec, x.mat, family = c("gaussian", "linear", "binomial", "logit", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-07, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL)
ncpen(y.vec, x.mat, family = c("gaussian", "linear", "binomial", "logit", "poisson", "multinomial", "cox"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-07, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL)
y.vec |
(numeric vector) response vector.
Must be 0,1 for |
x.mat |
(numeric matrix) design matrix without intercept.
The censoring indicator must be included at the last column of the design matrix for |
family |
(character) regression model. Supported models are
|
penalty |
(character) penalty function.
Supported penalties are
|
x.standardize |
(logical) whether to standardize |
intercept |
(logical) whether to include an intercept in the model. |
lambda |
(numeric vector) user-specified sequence of |
n.lambda |
(numeric) the number of |
r.lambda |
(numeric) ratio of the smallest |
w.lambda |
(numeric vector) penalty weights for each coefficient (see references). If a penalty weight is set to 0, the corresponding coefficient is always nonzero. |
gamma |
(numeric) additional tuning parameter for controlling shrinkage effect of |
tau |
(numeric) concavity parameter of the penalties (see reference).
Default is 3.7 for |
alpha |
(numeric) ridge effect (weight between the penalty and ridge penalty) (see details).
Default value is 1. If penalty is |
df.max |
(numeric) the maximum number of nonzero coefficients. |
cf.max |
(numeric) the maximum of absolute value of nonzero coefficients. |
proj.min |
(numeric) the projection cycle inside CD algorithm (largely internal use. See details). |
add.max |
(numeric) the maximum number of variables added in CCCP iterations (largely internal use. See references). |
niter.max |
(numeric) maximum number of iterations in CCCP. |
qiter.max |
(numeric) maximum number of quadratic approximations in each CCCP iteration. |
aiter.max |
(numeric) maximum number of iterations in CD algorithm. |
b.eps |
(numeric) convergence threshold for coefficients vector in CD algorithm |
k.eps |
(numeric) convergence threshold for KKT conditions. |
c.eps |
(numeric) convergence threshold for KKT conditions (largely internal use). |
cut |
(logical) convergence threshold for KKT conditions (largely internal use). |
local |
(logical) whether to use local initial estimator for path construction. It may take a long time. |
local.initial |
(numeric vector) initial estimator for |
The sequence of models indexed by lambda
is fit
by using concave convex procedure (CCCP) and coordinate descent (CD) algorithm (see references).
The objective function is
for gaussian
and
for the others,
assuming the canonical link.
The algorithm applies the warm start strategy (see references) and tries projections
after proj.min
iterations in CD algorithm, which makes the algorithm fast and stable.
x.standardize
makes each column of x.mat
to have the same Euclidean length
but the coefficients will be re-scaled into the original.
In multinomial
case, the coefficients are expressed in vector form. Use coef.ncpen
.
An object with S3 class ncpen
.
y.vec |
response vector. |
x.mat |
design matrix. |
family |
regression model. |
penalty |
penalty. |
x.standardize |
whether to standardize |
intercept |
whether to include the intercept. |
std |
scale factor for |
lambda |
sequence of |
w.lambda |
penalty weights. |
gamma |
extra shrinkage parameter for |
alpha |
ridge effect. |
local |
whether to use local initial estimator. |
local.initial |
local initial estimator for |
beta |
fitted coefficients. Use |
df |
the number of non-zero coefficients. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Kim, D., Lee, S. and Kwon, S. (2018). A unified algorithm for the non-convex penalized estimation: The ncpen
package.
http://arxiv.org/abs/1811.05061.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96, 1348-60.
Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942.
Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807-832.
Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67.
Kwon, S. Kim, Y. and Choi, H.(2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242.
Huang, J., Horowitz, J.L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2), 587-613.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509.
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
coef.ncpen
, plot.ncpen
, gic.ncpen
, predict.ncpen
, cv.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian", penalty="scad")
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian", penalty="scad")
Fits generalized linear models by penalized maximum likelihood estimation.
The coefficients path is computed for the regression model over a grid of the regularization parameter lambda
.
Fits Gaussian (linear), binomial Logit (logistic), Poisson, multinomial Logit regression models, and
Cox proportional hazard model with various non-convex penalties.
ncpen.reg(formula, data, family = c("gaussian", "linear", "binomial", "logit", "multinomial", "cox", "poisson"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-07, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL)
ncpen.reg(formula, data, family = c("gaussian", "linear", "binomial", "logit", "multinomial", "cox", "poisson"), penalty = c("scad", "mcp", "tlp", "lasso", "classo", "ridge", "sridge", "mbridge", "mlog"), x.standardize = TRUE, intercept = TRUE, lambda = NULL, n.lambda = NULL, r.lambda = NULL, w.lambda = NULL, gamma = NULL, tau = NULL, alpha = NULL, df.max = 50, cf.max = 100, proj.min = 10, add.max = 10, niter.max = 30, qiter.max = 10, aiter.max = 100, b.eps = 1e-07, k.eps = 1e-04, c.eps = 1e-06, cut = TRUE, local = FALSE, local.initial = NULL)
formula |
(formula) regression formula. To include/exclude intercept, use |
data |
(numeric matrix or data.frame) contains both y and X. Each row is an observation vector.
The censoring indicator must be included at the last column of the data for |
family |
(character) regression model. Supported models are
|
penalty |
(character) penalty function.
Supported penalties are
|
x.standardize |
(logical) whether to standardize |
intercept |
(logical) whether to include an intercept in the model. |
lambda |
(numeric vector) user-specified sequence of |
n.lambda |
(numeric) the number of |
r.lambda |
(numeric) ratio of the smallest |
w.lambda |
(numeric vector) penalty weights for each coefficient (see references). If a penalty weight is set to 0, the corresponding coefficient is always nonzero. |
gamma |
(numeric) additional tuning parameter for controlling shrinkage effect of |
tau |
(numeric) concavity parameter of the penalties (see reference).
Default is 3.7 for |
alpha |
(numeric) ridge effect (weight between the penalty and ridge penalty) (see details).
Default value is 1. If penalty is |
df.max |
(numeric) the maximum number of nonzero coefficients. |
cf.max |
(numeric) the maximum of absolute value of nonzero coefficients. |
proj.min |
(numeric) the projection cycle inside CD algorithm (largely internal use. See details). |
add.max |
(numeric) the maximum number of variables added in CCCP iterations (largely internal use. See references). |
niter.max |
(numeric) maximum number of iterations in CCCP. |
qiter.max |
(numeric) maximum number of quadratic approximations in each CCCP iteration. |
aiter.max |
(numeric) maximum number of iterations in CD algorithm. |
b.eps |
(numeric) convergence threshold for coefficients vector. |
k.eps |
(numeric) convergence threshold for KKT conditions. |
c.eps |
(numeric) convergence threshold for KKT conditions (largely internal use). |
cut |
(logical) convergence threshold for KKT conditions (largely internal use). |
local |
(logical) whether to use local initial estimator for path construction. It may take a long time. |
local.initial |
(numeric vector) initial estimator for |
The sequence of models indexed by lambda
is fit
by using concave convex procedure (CCCP) and coordinate descent (CD) algorithm (see references).
The objective function is
for gaussian
and
for the others,
assuming the canonical link.
The algorithm applies the warm start strategy (see references) and tries projections
after proj.min
iterations in CD algorithm, which makes the algorithm fast and stable.
x.standardize
makes each column of x.mat
to have the same Euclidean length
but the coefficients will be re-scaled into the original.
In multinomial
case, the coefficients are expressed in vector form. Use coef.ncpen
.
An object with S3 class ncpen
.
y.vec |
response vector. |
x.mat |
design matrix. |
family |
regression model. |
penalty |
penalty. |
x.standardize |
whether to standardize |
intercept |
whether to include the intercept. |
std |
scale factor for |
lambda |
sequence of |
w.lambda |
penalty weights. |
gamma |
extra shrinkage parameter for |
alpha |
ridge effect. |
local |
whether to use local initial estimator. |
local.initial |
local initial estimator for |
beta |
fitted coefficients. Use |
df |
the number of non-zero coefficients. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96, 1348-60. Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807-832. Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67. Kwon, S. Kim, Y. and Choi, H.(2013). Sparse bridge estimation with a diverging number of parameters. Statistics and Its Interface, 6, 231-242. Huang, J., Horowitz, J.L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2), 587-613. Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509. Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
coef.ncpen
, plot.ncpen
, gic.ncpen
, predict.ncpen
, cv.ncpen
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=5,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec data = cbind(y.vec, x.mat) colnames(data) = c("y", paste("xv", 1:ncol(x.mat), sep = "")) fit1 = ncpen.reg(formula = y ~ xv1 + xv2 + xv3 + xv4 + xv5, data = data, family="gaussian", penalty="scad") fit2 = ncpen(y.vec=y.vec,x.mat=x.mat);
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=5,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec data = cbind(y.vec, x.mat) colnames(data) = c("y", paste("xv", 1:ncol(x.mat), sep = "")) fit1 = ncpen.reg(formula = y ~ xv1 + xv2 + xv3 + xv4 + xv5, data = data, family="gaussian", penalty="scad") fit2 = ncpen(y.vec=y.vec,x.mat=x.mat);
The function Produces a plot of the cross-validated errors from cv.ncpen
object.
## S3 method for class 'cv.ncpen' plot(x, type = c("rmse", "like"), log.scale = FALSE, ...)
## S3 method for class 'cv.ncpen' plot(x, type = c("rmse", "like"), log.scale = FALSE, ...)
x |
fitted |
type |
(character) a cross-validated error type which is either |
log.scale |
(logical) whether to use log scale of lambda for horizontal axis. |
... |
other graphical parameters to |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty par(mfrow=c(1,2)) sam = sam.gen.ncpen(n=500,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=50,family="gaussian", penalty="scad") plot(fit) plot(fit,log.scale=F)
### linear regression with scad penalty par(mfrow=c(1,2)) sam = sam.gen.ncpen(n=500,p=10,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian") x.mat = sam$x.mat; y.vec = sam$y.vec fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,n.lambda=50,family="gaussian", penalty="scad") plot(fit) plot(fit,log.scale=F)
ncpen
object.Produces a plot of the coefficients paths for a fitted ncpen
object. Class-wise paths can be drawn for multinomial
.
## S3 method for class 'ncpen' plot(x, log.scale = FALSE, mult.type = c("mat", "vec"), ...)
## S3 method for class 'ncpen' plot(x, log.scale = FALSE, mult.type = c("mat", "vec"), ...)
x |
(ncpen object) Fitted |
log.scale |
(logical) whether to use log scale of lambda for horizontal axis. |
mult.type |
(character) additional option for |
... |
other graphical parameters to |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) plot(fit) ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") plot(fit) plot(fit,mult.type="vec",log.scale=TRUE)
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat) plot(fit) ### multinomial regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="multinomial",penalty="classo") plot(fit) plot(fit,mult.type="vec",log.scale=TRUE)
power.data
power data and return a data.frame
with column names with tail.
power.data(data, power, tail = "_pow")
power.data(data, power, tail = "_pow")
data |
a |
power |
power. |
tail |
tail text for column names for powered data. For example, if a column "sales" is powered by 4 (= |
This returns an object of matrix
.
df = data.frame(a = 1:3, b= 4:6); power.data(df, 2, ".pow");
df = data.frame(a = 1:3, b= 4:6); power.data(df, 2, ".pow");
ncpen
objectThe function provides various types of predictions from a fitted ncpen
object:
response, regression, probability, root mean squared error (RMSE), negative log-likelihood (LIKE).
## S3 method for class 'ncpen' predict(object, type = c("y", "reg", "prob", "rmse", "like"), new.y.vec = NULL, new.x.mat = NULL, prob.cut = 0.5, ...)
## S3 method for class 'ncpen' predict(object, type = c("y", "reg", "prob", "rmse", "like"), new.y.vec = NULL, new.x.mat = NULL, prob.cut = 0.5, ...)
object |
(ncpen object) fitted |
type |
(character) type of prediction.
|
new.y.vec |
(numeric vector). vector of new response at which predictions are to be made. |
new.x.mat |
(numeric matrix). matrix of new design at which predictions are to be made. |
prob.cut |
(numeric) threshold value of probability for |
... |
other S3 parameters. Not used. |
prediction values depending on type
for all lambda values.
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Lee, S., Kwon, S. and Kim, Y. (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics and Data Analysis, 94, 275-286.
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,]) predict(fit,"y",new.x.mat=x.mat[190:200,]) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,],family="binomial",penalty="classo") predict(fit,"y",new.x.mat=x.mat[190:200,]) predict(fit,"y",new.x.mat=x.mat[190:200,],prob.cut=0.3) predict(fit,"reg",new.x.mat=x.mat[190:200,]) predict(fit,"prob",new.x.mat=x.mat[190:200,]) ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,],family="multinomial",penalty="classo") predict(fit,"y",new.x.mat=x.mat[190:200,]) predict(fit,"reg",new.x.mat=x.mat[190:200,]) predict(fit,"prob",new.x.mat=x.mat[190:200,])
### linear regression with scad penalty sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,]) predict(fit,"y",new.x.mat=x.mat[190:200,]) ### logistic regression with classo penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="binomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,],family="binomial",penalty="classo") predict(fit,"y",new.x.mat=x.mat[190:200,]) predict(fit,"y",new.x.mat=x.mat[190:200,],prob.cut=0.3) predict(fit,"reg",new.x.mat=x.mat[190:200,]) predict(fit,"prob",new.x.mat=x.mat[190:200,]) ### multinomial regression with sridge penalty sam = sam.gen.ncpen(n=200,p=20,q=5,k=3,cf.min=0.5,cf.max=1,corr=0.5,family="multinomial") x.mat = sam$x.mat; y.vec = sam$y.vec fit = ncpen(y.vec=y.vec[1:190],x.mat=x.mat[1:190,],family="multinomial",penalty="classo") predict(fit,"y",new.x.mat=x.mat[190:200,]) predict(fit,"reg",new.x.mat=x.mat[190:200,]) predict(fit,"prob",new.x.mat=x.mat[190:200,])
Generate a synthetic dataset based on the correlation structure from generalized linear models.
sam.gen.ncpen(n = 100, p = 50, q = 10, k = 3, r = 0.3, cf.min = 0.5, cf.max = 1, corr = 0.5, seed = NULL, family = c("gaussian", "binomial", "multinomial", "cox", "poisson"))
sam.gen.ncpen(n = 100, p = 50, q = 10, k = 3, r = 0.3, cf.min = 0.5, cf.max = 1, corr = 0.5, seed = NULL, family = c("gaussian", "binomial", "multinomial", "cox", "poisson"))
n |
(numeric) the number of samples. |
p |
(numeric) the number of variables. |
q |
(numeric) the number of nonzero coefficients. |
k |
(numeric) the number of classes for |
r |
(numeric) the ratio of censoring for |
cf.min |
(numeric) value of the minimum coefficient. |
cf.max |
(numeric) value of the maximum coefficient. |
corr |
(numeric) strength of correlations in the correlation structure. |
seed |
(numeric) seed number for random generation. Default does not use seed. |
family |
(character) model type. |
A design matrix for regression models is generated from the multivariate normal distribution with a correlation structure.
Then the response variables are computed with a specific model based on the true coefficients (see references).
Note the censoring indicator locates at the last column of x.mat
for cox
.
An object with list class containing
x.mat |
design matrix. |
y.vec |
responses. |
b.vec |
true coefficients. |
Dongshin Kim, Sunghoon Kwon, Sangin Lee
Kwon, S., Lee, S. and Kim, Y. (2016). Moderately clipped LASSO. Computational Statistics and Data Analysis, 92C, 53-67. Kwon, S. and Kim, Y. (2012). Large sample properties of the SCAD-penalized maximum likelihood estimation on high dimensions. Statistica Sinica, 629-653.
### linear regression sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec head(x.mat); head(y.vec)
### linear regression sam = sam.gen.ncpen(n=200,p=20,q=5,cf.min=0.5,cf.max=1,corr=0.5) x.mat = sam$x.mat; y.vec = sam$y.vec head(x.mat); head(y.vec)
This is internal use only function.
same.base(base.cols, a, b)
same.base(base.cols, a, b)
base.cols |
vector of base column names. |
a |
first column to be compared. |
b |
second column to be compared. |
TRUE if same base, FALSE otherwise.
to.indicators
converts a categorical variable into a data.frame
with indicator (0 or 1) variables for each category.
to.indicators(vec, exclude.base = TRUE, base = NULL, prefix = NULL)
to.indicators(vec, exclude.base = TRUE, base = NULL, prefix = NULL)
vec |
a categorical vector. |
exclude.base |
|
base |
a base category removed from the indicator matrix. This option works
only when the |
prefix |
a prefix to be used for column names of the output matrix.
Default is "cat_" if |
This returns an object of matrix
which contains indicators.
a1 = 4:10; b1 = c("aa", "bb", "cc"); to.indicators(a1, base = 10); to.indicators(b1, base = "bb", prefix = "T_"); to.indicators(as.data.frame(b1), base = "bb");
a1 = 4:10; b1 = c("aa", "bb", "cc"); to.indicators(a1, base = 10); to.indicators(b1, base = "bb", prefix = "T_"); to.indicators(as.data.frame(b1), base = "bb");
data.frame
to a ncpen
usable matrix
.This automates the processes of to.indicators
and interact.data
.
First, it converts categorical variables to a series of indicators.
All other numerical and logical variables are preserved.
Then, if interact.all == TRUE
, all the variables are interacted.
to.ncpen.x.mat(df, base = NULL, interact.all = FALSE, base.cols = NULL, exclude.pair = NULL)
to.ncpen.x.mat(df, base = NULL, interact.all = FALSE, base.cols = NULL, exclude.pair = NULL)
df |
a |
base |
a base category removed from the indicator variables. This |
interact.all |
indicates whether to interact all the columns ( |
base.cols |
indicates columns derived from a same column. For example, if |
exclude.pair |
the pairs will be excluded from interactions. This should be a |
This returns an object of matrix
.
df = data.frame(num = c(1, 2, 3, 4, 5), ctr = c("K", "O", "R", "R", "K"), logi = c(TRUE, TRUE, FALSE, FALSE, TRUE), age = c(10, 20, 30, 40, 50), age_sq = c(10, 20, 30, 40, 50)^2, loc = c("b", "a", "c", "a", "b"), FTHB = c(1,0,1,0,1), PRM = c(0,1,0,1,0), PMI = c(1,1,0,0,0)); to.ncpen.x.mat(df, interact.all = TRUE, base.cols = c("age"), exclude.pair = list(c("FTHB", "PRM")));
df = data.frame(num = c(1, 2, 3, 4, 5), ctr = c("K", "O", "R", "R", "K"), logi = c(TRUE, TRUE, FALSE, FALSE, TRUE), age = c(10, 20, 30, 40, 50), age_sq = c(10, 20, 30, 40, 50)^2, loc = c("b", "a", "c", "a", "b"), FTHB = c(1,0,1,0,1), PRM = c(0,1,0,1,0), PMI = c(1,1,0,0,0)); to.ncpen.x.mat(df, interact.all = TRUE, base.cols = c("age"), exclude.pair = list(c("FTHB", "PRM")));