Title: | Tree-Guided Rare Feature Selection and Logic Aggregation |
---|---|
Description: | Implementation of the tree-guided feature selection and logic aggregation approach introduced in Chen et al. (2024) <doi:10.1080/01621459.2024.2326621>. The method enables the selection and aggregation of large-scale rare binary features with a known hierarchical structure using a convex, linearly-constrained regularized regression framework. The package facilitates the application of this method to both linear regression and binary classification problems by solving the optimization problem via the smoothing proximal gradient descent algorithm (Chen et al. (2012) <doi:10.1214/11-AOAS514>). |
Authors: | Jianmin Chen [aut, cre], Kun Chen [aut] |
Maintainer: | Jianmin Chen <jianminc000@gmail.com> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2025-03-17 22:20:13 UTC |
Source: | CRAN |
This package provides functions and visualization tools for fitting the Tree-Guided Rare Feature Selection and Logic Aggregation model.
Jianmin Chen jianminc000@gmail.com, Kun Chen
Function to output group norms on the gamma coefficients
based on g_idx
and C2
matrix.
cal2norm(u, g_idx, type)
cal2norm(u, g_idx, type)
u |
|
g_idx |
Group structure matrix defined by the |
type |
If |
Sum of group norms or individual group norms.
Synthetic data used to illustrate how to use TSLA with classification.
data(ClassificationExample)
data(ClassificationExample)
List containing the following elements:
Original tree structure with 42 leaf nodes and 5 different levels.
Original design matrix with 42 binary features and 400 observations.
Unpenalized covariate.
Continuous response of length 400.
Get coefficients from a TSLA.fit object.
coef_TSLA(object, ...)
coef_TSLA(object, ...)
object |
A fit output from |
... |
Other parameters. |
A list of coefficients for each combination of and
.
The first dimension is indexed by the coefficient,
the second dimension is indexed by
,
and the third dimension is
indexed by
.
Intercept |
Intercept. |
cov.coef |
Coefficients for unpenalized features. |
gamma.coef |
Node coefficients. Also see details in |
beta.coef |
Regression coefficients. Each coefficient is multiplied by
|
beta.coef.adj |
Regression coefficients |
Conduct cross validation to select the optimal tuning parameters in TSLA.
cv.TSLA( y, X_1 = NULL, X_2, treemat, family = c("ls", "logit"), penalty = c("CL2", "RFS-Sum"), pred.loss = c("MSE", "AUC", "deviance"), gamma.init = NULL, weight = NULL, nfolds = 5, group.weight = NULL, feature.weight = NULL, control = list(), modstr = list() )
cv.TSLA( y, X_1 = NULL, X_2, treemat, family = c("ls", "logit"), penalty = c("CL2", "RFS-Sum"), pred.loss = c("MSE", "AUC", "deviance"), gamma.init = NULL, weight = NULL, nfolds = 5, group.weight = NULL, feature.weight = NULL, control = list(), modstr = list() )
y |
Response in matrix form, continuous for |
X_1 |
Design matrix for unpenalized features (excluding intercept). Need to be in the matrix form. |
X_2 |
Expanded design matrix for |
treemat |
Expanded tree structure in matrix form for
|
family |
Two options. Use "ls" for least square problems and "logit" for logistic regression problems. |
penalty |
Two options for group penalty on |
pred.loss |
Model performance metrics. If |
gamma.init |
Initial value for the optimization. Default is a zero vector.
The length should equal to 1+ |
weight |
A vector of length two and it is used for logistic regression only. The first element corresponds to weight of y=1 and the second element corresponds to weight of y=0. |
nfolds |
Number of cross validation folds. Default is 5. |
group.weight |
User-defined weights for group penalty. Need to be a vector and the length equals to the number of groups. |
feature.weight |
User-defined weights for each predictor after expansion. |
control |
A list of parameters controlling algorithm convergence. Default values:
|
modstr |
A list of parameters controlling tuning parameters. Default values:
|
A list of cross validation results.
lambda.min |
|
alpha.min |
|
cvm |
A (number-of-lambda * number-of-alpha) matrix saving the means of cross validation loss across folds. |
cvsd |
A (number-of-lambda * number-of-alpha) matrix saving standard deviations of cross validation loss across folds. |
TSLA.fit |
Outputs from |
Intercept.min |
Intercept corresponding to |
cov.min |
Coefficients of unpenalized features
corresponding to |
beta.min |
Coefficients of binary features corresponding
to |
gamma.min |
Node coefficients corresponding to |
groupnorm.min |
Group norms of node coefficients corresponding to |
lambda.min.index |
Index of the best |
alpha.min.index |
Index of the best |
# Load the synthetic data data(ClassificationExample) tree.org <- ClassificationExample$tree.org # original tree structure x2.org <- ClassificationExample$x.org # original design matrix x1 <- ClassificationExample$x1 y <- ClassificationExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # Do train-test split idtrain <- 1:200 x1.train <- as.matrix(x1[idtrain, ]) x2.train <- x2[idtrain, ] y.train <- y[idtrain, ] x1.test <- as.matrix(x1[-idtrain, ]) x2.test <- x2[-idtrain, ] y.test <- y[-idtrain, ] # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) modstr <- list(nlambda = 5, alpha = seq(0, 1, length.out = 5)) simu.cv <- cv.TSLA(y = y.train, as.matrix(x1[idtrain, ]), X_2 = x2.train, treemat = tree.expand, family = 'logit', penalty = 'CL2', pred.loss = 'AUC', gamma.init = NULL, weight = c(1, 1), nfolds = 5, group.weight = NULL, feature.weight = NULL, control = control, modstr = modstr) # Do prediction with the selected tuning parameters on the test set. Report AUC on the test set. rmid <- simu.cv$TSLA.fit$rmid # remove all zero columns if(length(rmid) > 0){ x2.test <- x2.test[, -rmid]} y.new <- predict_cvTSLA(simu.cv, as.matrix(x1[-idtrain, ]), x2.test) library(pROC) auc(as.vector(y.test), as.vector(y.new))
# Load the synthetic data data(ClassificationExample) tree.org <- ClassificationExample$tree.org # original tree structure x2.org <- ClassificationExample$x.org # original design matrix x1 <- ClassificationExample$x1 y <- ClassificationExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # Do train-test split idtrain <- 1:200 x1.train <- as.matrix(x1[idtrain, ]) x2.train <- x2[idtrain, ] y.train <- y[idtrain, ] x1.test <- as.matrix(x1[-idtrain, ]) x2.test <- x2[-idtrain, ] y.test <- y[-idtrain, ] # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) modstr <- list(nlambda = 5, alpha = seq(0, 1, length.out = 5)) simu.cv <- cv.TSLA(y = y.train, as.matrix(x1[idtrain, ]), X_2 = x2.train, treemat = tree.expand, family = 'logit', penalty = 'CL2', pred.loss = 'AUC', gamma.init = NULL, weight = c(1, 1), nfolds = 5, group.weight = NULL, feature.weight = NULL, control = control, modstr = modstr) # Do prediction with the selected tuning parameters on the test set. Report AUC on the test set. rmid <- simu.cv$TSLA.fit$rmid # remove all zero columns if(length(rmid) > 0){ x2.test <- x2.test[, -rmid]} y.new <- predict_cvTSLA(simu.cv, as.matrix(x1[-idtrain, ]), x2.test) library(pROC) auc(as.vector(y.test), as.vector(y.new))
This function generates all the intermediate quantities based on the tree-guided reparameterization.
get_tree_object( X_2, treemat, penalty = c("CL2", "DL2", "RFS-Sum"), group.weight = NULL, feature.weight = NULL )
get_tree_object( X_2, treemat, penalty = c("CL2", "DL2", "RFS-Sum"), group.weight = NULL, feature.weight = NULL )
X_2 |
Expanded design matrix for |
treemat |
Expanded tree structure for
|
penalty |
Two options for group penalty on |
group.weight |
User-defined weights for group penalty. Need to be a vector and the length equals to the number of groups. |
feature.weight |
User-defined weights for each predictor after expansion. |
A list consists of quantities needed for SPG optimization.
C_1 |
C_1 matrix for generalized lasso penalty. |
CNorm_1 |
Nuclear norm of matrix |
C_2 |
C_2 matrix for group lasso penalty. |
CNorm_2 |
Nuclear norm of matrix |
A |
A (number-of-leaf * number-of-node) binary matrix containing linear constraints.
Recall that |
g_idx |
A (number-of-group * 3) matrix.
Each column stands for starting row in |
M2 |
A (number-of-leaf * number-of-level) node index matrix, with index going from 1 to the number of nodes. Root node has index equal to the number of nodes. Each row corresponds to a variable at the finest level, each column corresponds to an ordered classification level; the entry values in each column are the unique indices of the variables at that level. As we move to the right, the number of unique values becomes fewer. |
Tree |
A (number-of-group * number-of-node) group index matrix. Each row is a group and the column order is the same as the order of node index in M2. If the jth node belongs to the ith group, then the (i, j) element of the matrix is 1; otherwise the element is 0. |
A.adj |
A (number-of-leaf * number-of-node) binary matrix
containing linear constraints.
It is used with |
Function that generates aggregated features based on the TSLA output.
getaggr(TSLA.object, X_2, X_2.org, lambda.index, alpha.index)
getaggr(TSLA.object, X_2, X_2.org, lambda.index, alpha.index)
TSLA.object |
A fit output from |
X_2 |
Expanded design matrix in matrix form. |
X_2.org |
Original design matrix in matrix form. |
lambda.index |
Index of the |
alpha.index |
Index of the |
A data.frame of the aggregated feature.
dataset |
aggregated features. |
Give the expanded design matrix and the expanded tree structure by adding interactions in conformity to the structure.
getetmat(tmatrix, dmatrix)
getetmat(tmatrix, dmatrix)
tmatrix |
Tree structure of the original features in matrix form. |
dmatrix |
Original design matrix in matrix form. |
This function is used by the TSLA method only when the penalty
is
selected as "CL2". The all zero columns produced by the
interactions are excluded in the output.
For the TSLA method, the signs of the coefficients in the linear
constraints depend on the order of the term.
To better extend the method in implementation, we apply the signs on
the feature vectors instead of the regression coefficients.
For example, we use feature vector - instead of
.
The expanded design matrix
x.expand
from this
function is adjusted by the signs. The A
matrix and
all the coefficients estimated from the package
can be explained correspondingly. We also provide x.expand.adj
,
A.adj
, and beta.coef.adj
as the quantities with the effects of the signs removed.
The input tree structure of the original features needs to be constructed as the following: each row corresponds to a variable at the finest level; each column corresponds to an ordered classification level with the leaf level at the left-most and the root level at the right-most; the entry values in each column are the index of the ancestor node of the variable at that level. As we move from left to right, the number of unique values in the column becomes fewer.
A list.
x.expand |
The design matrix after expansion.
Each column is multiplied by |
tree.expand |
The tree structure after expansion. |
x.expand.adj |
The design matrix after expansion with the effects of signs removed. |
Evaluate the prediction performance under the classification settings.
getperform( ytest, ypretest, family, threshold.method = c("youden", "specificity.control", "quantile"), specificity = NULL )
getperform( ytest, ypretest, family, threshold.method = c("youden", "specificity.control", "quantile"), specificity = NULL )
ytest |
Response vector for test data. |
ypretest |
Predicted probability for test data. |
family |
"ls" or "logic". Return MSE when "ls" is used. |
threshold.method |
Method to get the threshold. |
specificity |
User-defined specificity or quantile. |
The function supports three methods to select the threshold of the predicted probability.
threshold.method = "youden"
: The optimal threshold corresponds to
the point that maximizes the distance to the identity (diagonal) line on
the ROC curve.
threshold.method = "specificity.control"
: The optimal threshold
corresponds to the smallest value that ensures the required specificity
value.
threshold.method = "quantile"
: The optimal threshold corresponds to
the required quantile of the predicted probability.
List of measures.
AUC |
Area under the ROC curve. |
AUPRC |
Area under the precision-recall curve. |
threshold |
Selected threshold of the probability. |
sensitivity |
Sensitivity with the selected threshold. |
ppv |
Positive predictive value with the selected threshold. |
specificity |
Specificity with the selected threshold. |
true.positive |
Number of true positive with the selected threshold. |
false.positive |
Number of false positive with the selected threshold. |
Return a tree plot.
plot_TSLA(TSLA.object, X_2, X_2.org, lambda.index, alpha.index)
plot_TSLA(TSLA.object, X_2, X_2.org, lambda.index, alpha.index)
TSLA.object |
A fit output from |
X_2 |
Expanded design matrix in matrix form. |
X_2.org |
Original design matrix in matrix form. |
lambda.index |
Index of the |
alpha.index |
Index of the |
A plot
# Load the synthetic data data(ClassificationExample) tree.org <- ClassificationExample$tree.org # original tree structure x2.org <- ClassificationExample$x.org # original design matrix x1 <- ClassificationExample$x1 y <- ClassificationExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # Do train-test split idtrain <- 1:200 x1.train <- as.matrix(x1[idtrain, ]) x2.train <- x2[idtrain, ] y.train <- y[idtrain, ] x1.test <- as.matrix(x1[-idtrain, ]) x2.test <- x2[-idtrain, ] y.test <- y[-idtrain, ] # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) modstr <- list(nlambda = 5, alpha = seq(0, 1, length.out = 5)) simu.cv <- cv.TSLA(y = y.train, as.matrix(x1[idtrain, ]), X_2 = x2.train, treemat = tree.expand, family = 'logit', penalty = 'CL2', pred.loss = 'AUC', gamma.init = NULL, weight = c(1, 1), nfolds = 5, group.weight = NULL, feature.weight = NULL, control = control, modstr = modstr) plot_TSLA(simu.cv$TSLA.fit, x2, x2.org, simu.cv$lambda.min.index, simu.cv$alpha.min.index)
# Load the synthetic data data(ClassificationExample) tree.org <- ClassificationExample$tree.org # original tree structure x2.org <- ClassificationExample$x.org # original design matrix x1 <- ClassificationExample$x1 y <- ClassificationExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # Do train-test split idtrain <- 1:200 x1.train <- as.matrix(x1[idtrain, ]) x2.train <- x2[idtrain, ] y.train <- y[idtrain, ] x1.test <- as.matrix(x1[-idtrain, ]) x2.test <- x2[-idtrain, ] y.test <- y[-idtrain, ] # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) modstr <- list(nlambda = 5, alpha = seq(0, 1, length.out = 5)) simu.cv <- cv.TSLA(y = y.train, as.matrix(x1[idtrain, ]), X_2 = x2.train, treemat = tree.expand, family = 'logit', penalty = 'CL2', pred.loss = 'AUC', gamma.init = NULL, weight = c(1, 1), nfolds = 5, group.weight = NULL, feature.weight = NULL, control = control, modstr = modstr) plot_TSLA(simu.cv$TSLA.fit, x2, x2.org, simu.cv$lambda.min.index, simu.cv$alpha.min.index)
A convenient function to get prediction from the selected tuning parameters by cross validation.
predict_cvTSLA( object, X_1_new = NULL, X_2_new, type = c("response", "link"), ... )
predict_cvTSLA( object, X_1_new = NULL, X_2_new, type = c("response", "link"), ... )
object |
A fit output from |
X_1_new |
New unpenalized features in matrix form. |
X_2_new |
New binary features in matrix form. |
type |
Two options: "response" or "link". The two options only
differ for |
... |
Other parameters. |
Predictions.
Generate prediction for the response.
predict_TSLA( object, X_1_new = NULL, X_2_new, type = c("response", "link"), ... )
predict_TSLA( object, X_1_new = NULL, X_2_new, type = c("response", "link"), ... )
object |
A fit output from |
X_1_new |
New unpenalized features in matrix form. |
X_2_new |
New binary features in matrix form. |
type |
Two options: "response" or "link". The two options only
differ for |
... |
Other parameters. |
Predictions. The first dimension is indexed by the observation,
the second dimension is indexed by ,
and the third dimension is indexed by
.
Synthetic data used to illustrate how to use TSLA with regression.
data(RegressionExample)
data(RegressionExample)
List containing the following elements:
Original tree structure with 42 leaf nodes and 5 different levels.
Original design matrix with 42 binary features and 400 observations.
Continuous response of length 400.
Find the solutions with a Smoothing Proximal Gradient (SPG) algorithm
for a sequence of and
values.
TSLA.fit( y, X_1 = NULL, X_2, treemat, family = c("ls", "logit"), penalty = c("CL2", "RFS-Sum"), gamma.init = NULL, weight = NULL, group.weight = NULL, feature.weight = NULL, control = list(), modstr = list() )
TSLA.fit( y, X_1 = NULL, X_2, treemat, family = c("ls", "logit"), penalty = c("CL2", "RFS-Sum"), gamma.init = NULL, weight = NULL, group.weight = NULL, feature.weight = NULL, control = list(), modstr = list() )
y |
Response in matrix form, continuous for |
X_1 |
Design matrix for unpenalized features (excluding intercept). Need to be in the matrix form. |
X_2 |
Expanded design matrix for |
treemat |
Expanded tree structure in matrix form for
|
family |
Two options. Use "ls" for least square problems and "logit" for logistic regression problems. |
penalty |
Two options for group penalty on |
gamma.init |
Initial value for the optimization. Default is a zero vector.
The length should equal to 1+ |
weight |
A vector of length two and it is used for logistic regression only. The first element corresponds to weight of y=1 and the second element corresponds to weight of y=0. |
group.weight |
User-defined weights for group penalty. Need to be a vector and the length equals to the number of groups. |
feature.weight |
User-defined weights for each predictor after expansion. |
control |
A list of parameters controlling algorithm convergence. Default values:
|
modstr |
A list of parameters controlling tuning parameters. Default values:
|
We adopt the warm start technique to speed up the calculation.
The warm start is applied with a fixed value of and a
descending sequence of
.
The objective function for "ls" is
subject to .
The objective function for "logit" is
subject to . Note that, in this package, the input parameter "alpha" is the
tuning parameter for the generalized lasso penalty.
Details for "penalty" option:
For penalty = "CL2"
, see details for the
"Child-l2" penalty in the main paper.
For penalty = "RFS-Sum"
, the theoretical optimal weights are used.
Please check the details in paper
"Rare feature selection in high dimensions".
A list of model fitting results.
gammacoef |
Estimation for |
groupnorm |
Weighted norms for each group. |
lambda.seq |
Sequence of |
alpha.seq |
Tuning parameter sequence for the generalized lasso penalty. |
rmid |
Column index for all zero features. |
family |
Option of |
cov.name |
Names for unpenalized features. |
bin.name |
Names for binary feautres. |
tree.object |
Outputs from |
Chen, J., Aseltine, R. H., Wang, F., & Chen, K. (2024).
Tree-Guided Rare Feature Selection and Logic Aggregation with
Electronic Health Records Data. Journal of the American Statistical Association 119(547), 1765-1777,
doi:10.1080/01621459.2024.2326621.
Chen, X., Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing (2012).
Smoothing proximal gradient method for general structured sparse regression.
The Annals of Applied Statistics 6(2), 719–752,
doi:10.1214/11-AOAS514.
Yan, X. and J. Bien (2021).
Rare feature selection in high dimensions.
Journal of the American Statistical Association 116(534), 887–900,
doi:10.1080/01621459.2020.1796677.
# Load the synthetic data data(RegressionExample) tree.org <- RegressionExample$tree.org # original tree structure x2.org <- RegressionExample$x.org # original design matrix y <- RegressionExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) # fit model with a pair of lambda and alpha modstr <- list(lambda = 1, alpha = 0.1) x1 <- NULL fit1 <- TSLA.fit(y, x1, x2, tree.expand, family = 'ls', penalty = 'CL2', gamma.init = NULL, weight = NULL, group.weight = NULL, feature.weight = NULL, control, modstr) # get group norms from fit1 fit1$groupnorm
# Load the synthetic data data(RegressionExample) tree.org <- RegressionExample$tree.org # original tree structure x2.org <- RegressionExample$x.org # original design matrix y <- RegressionExample$y # response # Do the tree-guided expansion expand.data <- getetmat(tree.org, x2.org) x2 <- expand.data$x.expand # expanded design matrix tree.expand <- expand.data$tree.expand # expanded tree structure # specify some model parameters set.seed(100) control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE) # fit model with a pair of lambda and alpha modstr <- list(lambda = 1, alpha = 0.1) x1 <- NULL fit1 <- TSLA.fit(y, x1, x2, tree.expand, family = 'ls', penalty = 'CL2', gamma.init = NULL, weight = NULL, group.weight = NULL, feature.weight = NULL, control, modstr) # get group norms from fit1 fit1$groupnorm