Title: | Variable Selection for Multiply Imputed Data |
---|---|
Description: | Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. 'miselect' presents Stacked Adaptive Elastic Net (saenet) and Grouped Adaptive LASSO (galasso) for continuous and binary outcomes, developed by Du et al (2022) <doi:10.1080/10618600.2022.2035739>. They, by construction, force selection of the same variables across multiply imputed data. 'miselect' also provides cross validated variants of these methods. |
Authors: | Michael Kleinsasser [cre], Alexander Rix [aut], Jiacong Du [aut] |
Maintainer: | Michael Kleinsasser <[email protected]> |
License: | GPL-3 |
Version: | 0.9.2 |
Built: | 2024-10-31 22:27:46 UTC |
Source: | CRAN |
Extract Coefficients From a "cv.galasso" Object
## S3 method for class 'cv.galasso' coef(object, lambda = object$lambda.min, ...)
## S3 method for class 'cv.galasso' coef(object, lambda = object$lambda.min, ...)
object |
A "cv.galasso" fit |
lambda |
Chosen value of lambda. Must be between "min(lambda)" and "max(lambda)". Default is "lambda.min" |
... |
Additional unused arguments |
A list of numeric vectors containing the coefficients from running
galasso
on lambda
for each imputation.
Extract Coefficients From a "cv.saenet" Object
## S3 method for class 'cv.saenet' coef(object, lambda = object$lambda.min, alpha = object$alpha.min, ...)
## S3 method for class 'cv.saenet' coef(object, lambda = object$lambda.min, alpha = object$alpha.min, ...)
object |
A "cv.saenet" fit |
lambda |
Chosen value of lambda. Must be between "min(lambda)" and "max(lambda)". Default is "lambda.min" |
alpha |
Chosen value of alpha. Must be between "min(alpha)" and "max(alpha)". Default is "alpha.min" |
... |
Additional unused arguments |
A numeric vector containing the coefficients from running
saenet
on lambda
and alpha
.
Extract Coefficients From a "galasso" Object
## S3 method for class 'galasso' coef(object, lambda, ...)
## S3 method for class 'galasso' coef(object, lambda, ...)
object |
A "galasso" fit |
lambda |
Chosen value of lambda. Must be between "min(lambda)" and "max(lambda)". Default is "lambda.min" |
... |
Additional unused arguments |
A list of length D containing the coefficient estimates from running
galasso
on lambda
.
coef.galasso
averages the estimates across imputations to return a
single vector instead of a matrix.
## S3 method for class 'saenet' coef(object, lambda, alpha, ...)
## S3 method for class 'saenet' coef(object, lambda, alpha, ...)
object |
A "cv.saenet" fit |
lambda |
Chosen value of lambda. Must be between "min(lambda)" and "max(lambda)". Default is "lambda.min" |
alpha |
Chosen value of alpha. Must be between "min(alpha)" and "max(alpha)". Default is "alpha.min" |
... |
Additional unused arguments |
A numeric vector containing the coefficients from running
saenet
on lambda
and alpha
.
Does k-fold cross-validation for galasso
, and returns an optimal value
for lambda.
cv.galasso( x, y, pf, adWeight, family = c("gaussian", "binomial"), nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, nfolds = 5, foldid = NULL, maxit = 1000, eps = 1e-05 )
cv.galasso( x, y, pf, adWeight, family = c("gaussian", "binomial"), nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, nfolds = 5, foldid = NULL, maxit = 1000, eps = 1e-05 )
x |
A length |
y |
A length |
pf |
Penalty factor. Can be used to differentially penalize certain variables |
adWeight |
Numeric vector of length p representing the adaptive weights for the L1 penalty |
family |
The type of response. "gaussian" implies a continuous response and "binomial" implies a binary response. Default is "gaussian". |
nlambda |
Length of automatically generated "lambda" sequence. If "lambda" is non NULL, "nlambda" is ignored. Default is 100 |
lambda.min.ratio |
Ratio that determines the minimum value of "lambda" when automatically generating a "lambda" sequence. If "lambda" is not NULL, "lambda.min.ratio" is ignored. Default is 1e-4 |
lambda |
Optional numeric vector of lambdas to fit. If NULL,
|
nfolds |
Number of foldid to use for cross validation. Default is 5, minimum is 3 |
foldid |
an optional length |
maxit |
Maximum number of iterations to run. Default is 10000 |
eps |
Tolerance for convergence. Default is 1e-5 |
cv.galasso
works by adding a group penalty to the aggregated objective
function to ensure selection consistency across imputations. Simulations
suggest that the "stacked" objective function approaches (i.e., saenet
)
tend to be more computationally efficient and have better estimation and
selection properties.
An object of type "cv.galasso" with 7 elements:
The call that generated the output.
The sequence of lambdas fit.
Average cross validation error for each "lambda". For family = "gaussian", "cvm" corresponds to mean squared error, and for binomial "cvm" corresponds to deviance.
Standard error of "cvm".
A "galasso" object fit to the full data.
The lambda value for the model with the minimum cross validation error.
The lambda value for the sparsest model within one standard error of the minimum cross validation error.
The number of nonzero coefficients for each value of lambda.
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. <doi:10.1080/10618600.2022.2035739>
library(miselect) library(mice) set.seed(48109) # Using the mice defaults for sake of example only. mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } pf <- rep(1, 20) adWeight <- rep(1, 20) fit <- cv.galasso(x, y, pf, adWeight) # By default 'coef' returns the betas for lambda.min. coef(fit)
library(miselect) library(mice) set.seed(48109) # Using the mice defaults for sake of example only. mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } pf <- rep(1, 20) adWeight <- rep(1, 20) fit <- cv.galasso(x, y, pf, adWeight) # By default 'coef' returns the betas for lambda.min. coef(fit)
Does k-fold cross-validation for saenet
, and returns optimal values
for lambda and alpha.
cv.saenet( x, y, pf, adWeight, weights, family = c("gaussian", "binomial"), alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, nfolds = 5, foldid = NULL, maxit = 1000, eps = 1e-05 )
cv.saenet( x, y, pf, adWeight, weights, family = c("gaussian", "binomial"), alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, nfolds = 5, foldid = NULL, maxit = 1000, eps = 1e-05 )
x |
A length |
y |
A length |
pf |
Penalty factor of length |
adWeight |
Numeric vector of length p representing the adaptive weights for the L1 penalty |
weights |
Numeric vector of length n containing the proportion observed (non-missing) for each row in the un-imputed data. |
family |
The type of response. "gaussian" implies a continuous response and "binomial" implies a binary response. Default is "gaussian". |
alpha |
Elastic net parameter. Can be a vector to cross validate over. Default is 1 |
nlambda |
Length of automatically generated "lambda" sequence. If "lambda" is non NULL, "nlambda" is ignored. Default is 100 |
lambda.min.ratio |
Ratio that determines the minimum value of "lambda" when automatically generating a "lambda" sequence. If "lambda" is not NULL, "lambda.min.ratio" is ignored. Default is 1e-3 |
lambda |
Optional numeric vector of lambdas to fit. If NULL,
|
nfolds |
Number of foldid to use for cross validation. Default is 5, minimum is 3 |
foldid |
an optional length |
maxit |
Maximum number of iterations to run. Default is 1000 |
eps |
Tolerance for convergence. Default is 1e-5 |
cv.saenet
works by stacking the multiply imputed data into a single
matrix and running a weighted adaptive elastic net on it. Simulations suggest
that the "stacked" objective function approaches tend to be more
computationally efficient and have better estimation and selection
properties.
Due to stacking, the automatically generated lambda
sequence
cv.saenet
generates may end up underestimating lambda.max
, and
thus the degrees of freedom may be nonzero at the first lambda value.
An object of type "cv.saenet" with 9 elements:
The call that generated the output.
Sequence of lambdas fit.
Average cross validation error for each lambda and alpha. For family = "gaussian", "cvm" corresponds to mean squared error, and for binomial "cvm" corresponds to deviance.
Standard error of "cvm".
A "saenet" object fit to the full data.
The lambda value for the model with the minimum cross validation error.
The lambda value for the sparsest model within one standard error of the minimum cross validation error.
The alpha value for the model with the minimum cross validation error.
The alpha value for the sparsest model within one standard error of the minimum cross validation error.
The number of nonzero coefficients for each value of lambda and alpha.
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. <doi:10.1080/10618600.2022.2035739>
library(miselect) library(mice) set.seed(48109) # Using the mice defaults for sake of example only. mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } # Calculate observational weights weights <- 1 - rowMeans(is.na(miselect.df)) pf <- rep(1, 20) adWeight <- rep(1, 20) # Since 'Y' is a binary variable, we use 'family = "binomial"' fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial") # By default 'coef' returns the betas for (lambda.min , alpha.min) coef(fit) # You can also cross validate over alpha fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial", alpha = c(.5, 1)) # Get selected variables from the 1 standard error rule coef(fit, lambda = fit$lambda.1se, alpha = fit$alpha.1se)
library(miselect) library(mice) set.seed(48109) # Using the mice defaults for sake of example only. mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } # Calculate observational weights weights <- 1 - rowMeans(is.na(miselect.df)) pf <- rep(1, 20) adWeight <- rep(1, 20) # Since 'Y' is a binary variable, we use 'family = "binomial"' fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial") # By default 'coef' returns the betas for (lambda.min , alpha.min) coef(fit) # You can also cross validate over alpha fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial", alpha = c(.5, 1)) # Get selected variables from the 1 standard error rule coef(fit, lambda = fit$lambda.1se, alpha = fit$alpha.1se)
galasso
fits an adaptive LASSO for multiply imputed data. "galasso"
supports both continuous and binary responses.
galasso( x, y, pf, adWeight, family = c("gaussian", "binomial"), nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, maxit = 10000, eps = 1e-05 )
galasso( x, y, pf, adWeight, family = c("gaussian", "binomial"), nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, maxit = 10000, eps = 1e-05 )
x |
A length |
y |
A length |
pf |
Penalty factor. Can be used to differentially penalize certain variables |
adWeight |
Numeric vector of length p representing the adaptive weights for the L1 penalty |
family |
The type of response. "gaussian" implies a continuous response and "binomial" implies a binary response. Default is "gaussian". |
nlambda |
Length of automatically generated "lambda" sequence. If "lambda" is non NULL, "nlambda" is ignored. Default is 100 |
lambda.min.ratio |
Ratio that determines the minimum value of "lambda" when automatically generating a "lambda" sequence. If "lambda" is not NULL, "lambda.min.ratio" is ignored. Default is 1e-4 |
lambda |
Optional numeric vector of lambdas to fit. If NULL,
|
maxit |
Maximum number of iterations to run. Default is 10000 |
eps |
Tolerance for convergence. Default is 1e-5 |
galasso
works by adding a group penalty to the aggregated objective
function to ensure selection consistency across imputations. The objective
function is:
Where L is the log likelihood,a
is the adaptive weights, and
pf
is the penalty factor. Simulations suggest that the "stacked"
objective function approach (i.e., saenet
) tends to be more
computationally efficient and have better estimation and selection
properties. However, the advantage of galasso
is that it allows one
to look at the differences between coefficient estimates across imputations.
An object with type galasso and subtype galasso.gaussian or galasso.binomial, depending on which family was used. Both subtypes have 4 elements:
Sequence of lambda fit.
a list of length D containing the coefficient estimates from running galasso at each value of lambda. Each element in the list is a nlambda x (p+1) matrix.
Number of nonzero betas at each value of lambda.
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. <doi:10.1080/10618600.2022.2035739>
library(miselect) library(mice) mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } pf <- rep(1, 20) adWeight <- rep(1, 20) fit <- galasso(x, y, pf, adWeight)
library(miselect) library(mice) mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } pf <- rep(1, 20) adWeight <- rep(1, 20) fit <- galasso(x, y, pf, adWeight)
This synthetic data is taken from the first simulation case from the miselect paper
miselect.df
miselect.df
A data.frame with 500 observations on 21 variables:
Binary response.
Covariates with missing data.
print.cv.galasso
print the fit and returns it invisibly.
## S3 method for class 'cv.galasso' print(x, ...)
## S3 method for class 'cv.galasso' print(x, ...)
x |
An object of type "cv.galasso" to print |
... |
Further arguments passed to or from other methods |
print.cv.saenet
print the fit and returns it invisibly.
## S3 method for class 'cv.saenet' print(x, ...)
## S3 method for class 'cv.saenet' print(x, ...)
x |
An object of type "cv.saenet" to print |
... |
Further arguments passed to or from other methods |
Fits an adaptive elastic net for multiply imputed data. The data is stacked and is penalized that each imputation selects the same betas at each value of lambda. "saenet" supports both continuous and binary responses.
saenet( x, y, pf, adWeight, weights, family = c("gaussian", "binomial"), alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, maxit = 1000, eps = 1e-05 )
saenet( x, y, pf, adWeight, weights, family = c("gaussian", "binomial"), alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06), lambda = NULL, maxit = 1000, eps = 1e-05 )
x |
A length |
y |
A length |
pf |
Penalty factor. Can be used to differentially penalize certain variables |
adWeight |
Numeric vector of length p representing the adaptive weights for the L1 penalty |
weights |
Numeric vector of length n containing the proportion observed (non-missing) for each row in the un-imputed data. |
family |
The type of response. "gaussian" implies a continuous response and "binomial" implies a binary response. Default is "gaussian". |
alpha |
Elastic net parameter. Can be a vector to cross validate over. Default is 1 |
nlambda |
Length of automatically generated "lambda" sequence. If "lambda" is non NULL, "nlambda" is ignored. Default is 100 |
lambda.min.ratio |
Ratio that determines the minimum value of "lambda" when automatically generating a "lambda" sequence. If "lambda" is not NULL, "lambda.min.ratio" is ignored. Default is 1e-3 |
lambda |
Optional numeric vector of lambdas to fit. If NULL,
|
maxit |
Maximum number of iterations to run. Default is 1000 |
eps |
Tolerance for convergence. Default is 1e-5 |
saenet
works by stacking the multiply imputed data into a single
matrix and running a weighted adaptive elastic net on it. The objective
function is:
Where L is the log likelihood, o = w / m
, a
is the
adaptive weights, and pf
is the penalty factor. Simulations suggest
that the "stacked" objective function approach (i.e., saenet
) tends
to be more computationally efficient and have better estimation and selection
properties. However, the advantage of galasso
is that it allows one
to look at the differences between coefficient estimates across imputations.
An object with type saenet and subtype saenet.gaussian or saenet.binomial, depending on which family was used. Both subtypes have 4 elements:
Sequence of lambda fit.
nlambda x nalpha x p + 1 tensor representing the estimated betas at each value of lambda and alpha.
Number of nonzero betas at each value of lambda and alpha.
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. <doi:10.1080/10618600.2022.2035739>
library(miselect) library(mice) mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } # Calculate observational weights weights <- 1 - rowMeans(is.na(miselect.df)) pf <- rep(1, 20) adWeight <- rep(1, 20) # Since 'Y' is a binary variable, we use 'family = "binomial"' fit <- saenet(x, y, pf, adWeight, weights, family = "binomial")
library(miselect) library(mice) mids <- mice(miselect.df, m = 5, printFlag = FALSE) dfs <- lapply(1:5, function(i) complete(mids, action = i)) # Generate list of imputed design matrices and imputed responses x <- list() y <- list() for (i in 1:5) { x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)]) y[[i]] <- dfs[[i]]$Y } # Calculate observational weights weights <- 1 - rowMeans(is.na(miselect.df)) pf <- rep(1, 20) adWeight <- rep(1, 20) # Since 'Y' is a binary variable, we use 'family = "binomial"' fit <- saenet(x, y, pf, adWeight, weights, family = "binomial")