Title: | Hierarchical GxE Interactions in a Regularized Regression Model |
---|---|
Description: | The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>. |
Authors: | Natalia Zemlianskaia |
Maintainer: | Natalia Zemlianskaia <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2 |
Built: | 2024-11-05 06:35:16 UTC |
Source: | CRAN |
The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.
Natalia Zemlianskaia
Maintainer: Natalia Zemlianskaia <[email protected]>
"A Scalable Hierarchical Lasso for Gene-Environment Interactions", Natalia Zemlianskaia, W.James Gauderman, Juan Pablo Lewinger https://arxiv.org/abs/2103.13510
Generates genotypes data matrix G (sample_size
by p
), vector of environmental measurments E, and an outcome vector Y of size sample_size
. Simulates training, validation, and test datasets.
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, family = "gaussian", mode = "strong_hierarchical", normalize = FALSE, normalize_response = FALSE, seed = 1, pG = 0.2, pE = 0.3, n_confounders = NULL)
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, family = "gaussian", mode = "strong_hierarchical", normalize = FALSE, normalize_response = FALSE, seed = 1, pG = 0.2, pE = 0.3, n_confounders = NULL)
sample_size |
sample size of the data |
p |
total number of main effects |
n_g_non_zero |
number of non-zero main effects to generate |
n_gxe_non_zero |
number of non-zero interaction effects to generate |
family |
"gaussian" for continous outcome Y and "binomial" for binary 0/1 outcome |
mode |
either "strong_hierarchical", "hierarchical", or "anti_hierarchical". In the strong hierarchical mode the hierarchical structure is maintained (beta_g = 0 then beta_gxe = 0) and also |beta_g| >= |beta_gxe|. In the hierarchical mode the hierarchical structure is maintained, but |beta_G| < |beta_gxe|. In the anti_hierarchical mode the hierarchical structure is violated (beta_g = 0 then beta_gxe != 0). |
normalize |
|
normalize_response |
|
pG |
genotypes prevalence, value from 0 to 1 |
pE |
environment prevalence, value from 0 to 1 |
seed |
random seed |
n_confounders |
number of confounders to generate, either |
A list of simulated datasets and generating coefficients
G_train , G_valid , G_test
|
generated genotypes matrices |
E_train , E_valid , E_test
|
generated vectors of environmental values |
Y_train , Y_valid , Y_test
|
generated outcome vectors |
C_train , C_valid , C_test
|
generated confounders matrices |
GxE_train , GxE_valid , GxE_test
|
generated GxE matrix |
Beta_G |
main effect coefficients vector |
Beta_GxE |
interaction coefficients vector |
beta_0 |
intercept coefficient value |
beta_E |
environment coefficient value |
Beta_C |
confounders coefficient values |
index_beta_non_zero , index_beta_gxe_non_zero , index_beta_zero , index_beta_gxe_zero
|
inner data generation variables |
n_g_non_zero |
number of non-zero main effects generated |
n_gxe_non_zero |
number of non-zero interactions generated |
n_total_non_zero |
total number of non-zero variables |
SNR_g |
signal-to-noise ratio for the main effects |
SNR_gxe |
signal-to-noise ratio for the interactions |
family , p , sample_size , mode , seed
|
input simulation parameters |
data = data.gen(sample_size=100, p=100) G = data$G_train; GxE = data$GxE_train E = data$E_train; Y = data$Y_train
data = data.gen(sample_size=100, p=100) G = data$G_train; GxE = data$GxE_train E = data$E_train; Y = data$Y_train
A function to obtain coefficients from the model fit object corresponding to the desired pair of tuning parameters lambda
= (lambda_1
, lambda_2
).
gesso.coef(fit, lambda)
gesso.coef(fit, lambda)
fit |
model fit object obtained either by using function |
lambda |
a pair of tuning parameters organized in a tibble (ex: |
A list of model coefficients corresponding to lambda
values of tuning parameters
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_c |
a vector of estimated confounders coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, parallel=TRUE, nfolds=3) gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, parallel=TRUE, nfolds=3) gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g
A function to obtain coefficients with target_b_gxe_non_zero
specified to control the desired sparsity of interactions in the model.
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
cv_model |
cross-validated model fit object obtained by using function |
target_b_gxe_non_zero |
number of non-zero interactions we want to inlcude in the model |
less_than |
|
A list of model coefficients corresponding to the best model that contains at most or at least target_b_gxe_non_zero
non-zero interaction terms.
The target model is selected based on the averaged cross-validation (cv) results: for each pair of parameters lambda
=(lambda_1, lambda_2) in the grid and each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda
and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero
non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full data with the selected tuning parameters.
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
beta_c |
a vector of estimated confounders coefficients |
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train) model_coefficients = gesso.coefnum(model, 5) gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train) model_coefficients = gesso.coefnum(model, 5) gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)
Performs nfolds
-fold cross-validation to tune hyperparmeters lambda_1
and lambda_2
for the gesso model.
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL, grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", type_measure = "loss", fold_ids = NULL, nfolds = 4, parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, min_working_set_size = 100, verbose = TRUE)
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL, grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", type_measure = "loss", fold_ids = NULL, nfolds = 4, parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, min_working_set_size = 100, verbose = TRUE)
G |
matrix of main effects of size |
E |
vector of environmental measurments |
Y |
outcome vector. Set |
C |
matrix of confounders of size |
normalize |
|
normalize_response |
|
grid |
grid sequence for tuning hyperparameters, we use the same grid for |
grid_size |
specify |
grid_min_ratio |
parameter to determine |
alpha |
if |
family |
|
type_measure |
loss to use for cross-validation. Specity |
fold_ids |
option to input custom folds assignments |
tolerance |
tolerance for the dual gap convergence criterion |
max_iterations |
maximum number of iterations |
min_working_set_size |
minimum size of the working set |
nfolds |
number of cross-validation splits |
parallel |
|
seed |
set random seed to control random folds assignments |
verbose |
|
A list of objects
cv_result |
a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of (
|
lambda_min |
a tibble of optimal ( |
fit |
list, return of the function gesso.fit on the full data |
grid |
vector of values used for hyperparameters tuning |
full_cv_result |
inner variables |
data = data.gen() tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, parallel=TRUE, nfolds=3) gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g
data = data.gen() tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, parallel=TRUE, nfolds=3) gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g
Fits gesso model over the two dimentional grid of hyperparmeters lambda_1
and lambda_2
, returns estimated coefficients for each pair of hyperparameters.
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL, grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", weights = NULL, tolerance = 1e-3, max_iterations = 5000, min_working_set_size = 100, verbose = FALSE)
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL, grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", weights = NULL, tolerance = 1e-3, max_iterations = 5000, min_working_set_size = 100, verbose = FALSE)
G |
matrix of main effects of size |
E |
vector of environmental measurments |
Y |
outcome vector. Set |
C |
matrix of confounders of size |
normalize |
|
normalize_response |
|
grid |
grid sequence for tuning hyperparameters, we use the same grid for |
grid_size |
specify |
grid_min_ratio |
parameter to determine |
alpha |
if |
family |
|
tolerance |
tolerance for the dual gap convergence criterion |
max_iterations |
maximum number of iterations |
min_working_set_size |
minimum size of the working set |
weights |
inner fitting parameter |
verbose |
|
A list of estimated coefficients and other model fit metrics for each pair of hyperparameters (lambda_1
, lambda_2
)
beta_0 |
vector of estimated intercept values of size |
beta_e |
vector of estimated environment coefficients of size |
beta_g |
matrix of estimated main effects coefficients organized by rows, size ( |
beta_gxe |
matrix of estimated interactions coefficients organized by rows, size ( |
beta_c |
matrix of estimated confounders coefficients organized by rows, size ( |
num_iterations |
number of iterations until convergence for each fit |
working_set_size |
maximum number of variables in the working set for each fit |
has_converged |
1 if the model converged within given |
objective_value |
objective function (loss) value for each fit |
beta_g_nonzero |
number of estimated non-zero main effects for each fit |
beta_gxe_nonzero |
number of estimated non-zero interactions for each fit |
lambda_1 |
|
lambda_2 |
|
grid |
vector of values used for hyperparameters tuning |
data = data.gen() fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE) plot(fit$beta_g_nonzero, pch=19, cex=0.4, ylab="num of non-zero features", xlab="lambdas path") points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
data = data.gen() fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE) plot(fit$beta_g_nonzero, pch=19, cex=0.4, ylab="num of non-zero features", xlab="lambdas path") points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
Predict new outcome vector based on the new data and estimated model coefficients.
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, beta_c=NULL, new_C=NULL, family = "gaussian")
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, beta_c=NULL, new_C=NULL, family = "gaussian")
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
new_G |
matrix of main effects, variables organized by columns |
new_E |
vector of environmental measurments |
beta_c |
a vector of estimated confounders coefficients |
new_C |
matrix of confounders, variables organized by columns |
family |
set |
Returns a vector of predicted values
data = data.gen() tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train) coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min) beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe new_G = data$G_test; new_E = data$E_test new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E) cor(new_Y, data$Y_test)^2
data = data.gen() tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train) coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min) beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe new_G = data$G_test; new_E = data$E_test new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E) cor(new_Y, data$Y_test)^2
Calculates principal selection metrics for the binary zero/non-zero classification problem (sensitivity, specificity, precision, auc).
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
true_b_g |
vector of true main effect coefficients |
true_b_gxe |
vector of true interaction coefficients |
estimated_b_g |
vector of estimated main effect coefficients |
estimated_b_gxe |
vector of estimated interaction coefficients |
A list of principal selection metrics
b_g_non_zero |
number of non-zero main effects |
b_gxe_non_zero |
number of non-zero interactions |
mse_b_g |
mean squared error for estimation of main effects effect sizes |
mse_b_gxe |
mean squared error for estimation of interactions effect sizes |
sensitivity_g |
recall of the non-zero main effects |
specificity_g |
recall of the zero main effects |
precision_g |
precision with respect to non-zero main effects |
sensitivity_gxe |
recall of the non-zero interactions |
specificity_gxe |
recall of the zero interactions |
precision_gxe |
precision with respect to non-zero interactions |
auc_g |
area under the curve for zero/non-zero binary classification problem for main effects |
auc_gxe |
area under the curve for zero/non-zero binary classification problem for interactions |
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train) gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)
data = data.gen() model = gesso.cv(data$G_train, data$E_train, data$Y_train) gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)