Package 'gesso' reference manual

Title:	Hierarchical GxE Interactions in a Regularized Regression Model
Description:	The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.
Authors:	Natalia Zemlianskaia
Maintainer:	Natalia Zemlianskaia <[email protected]>
License:	MIT + file LICENSE
Version:	1.0.2
Built:	2025-03-05 07:10:09 UTC
Source:	CRAN

Hierarchical GxE Interactions in a Regularized Regression Model

Description

The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.

Author(s)

Natalia Zemlianskaia

Maintainer: Natalia Zemlianskaia <[email protected]>

References

"A Scalable Hierarchical Lasso for Gene-Environment Interactions", Natalia Zemlianskaia, W.James Gauderman, Juan Pablo Lewinger https://arxiv.org/abs/2103.13510

Data Generation

Description

Generates genotypes data matrix G (sample_size by p), vector of environmental measurments E, and an outcome vector Y of size sample_size. Simulates training, validation, and test datasets.

Usage

data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, 
         family = "gaussian", mode = "strong_hierarchical", 
         normalize = FALSE, normalize_response = FALSE, 
         seed = 1, pG = 0.2, pE = 0.3,
         n_confounders = NULL)
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, 
         family = "gaussian", mode = "strong_hierarchical", 
         normalize = FALSE, normalize_response = FALSE, 
         seed = 1, pG = 0.2, pE = 0.3,
         n_confounders = NULL)

Arguments

`sample_size`	sample size of the data
`p`	total number of main effects
`n_g_non_zero`	number of non-zero main effects to generate
`n_gxe_non_zero`	number of non-zero interaction effects to generate
`family`	"gaussian" for continous outcome Y and "binomial" for binary 0/1 outcome
`mode`	either "strong_hierarchical", "hierarchical", or "anti_hierarchical". In the strong hierarchical mode the hierarchical structure is maintained (beta_g = 0 then beta_gxe = 0) and also \|beta_g\| >= \|beta_gxe\|. In the hierarchical mode the hierarchical structure is maintained, but \|beta_G\| < \|beta_gxe\|. In the anti_hierarchical mode the hierarchical structure is violated (beta_g = 0 then beta_gxe != 0).
`normalize`	`TRUE` to normalize matrix `G` and vector `E`
`normalize_response`	`TRUE` to normalize vector `Y`
`pG`	genotypes prevalence, value from 0 to 1
`pE`	environment prevalence, value from 0 to 1
`seed`	random seed
`n_confounders`	number of confounders to generate, either `NULL` or `>1`

Value

A list of simulated datasets and generating coefficients

`G_train`, `G_valid`, `G_test`	generated genotypes matrices
`E_train`, `E_valid`, `E_test`	generated vectors of environmental values
`Y_train`, `Y_valid`, `Y_test`	generated outcome vectors
`C_train`, `C_valid`, `C_test`	generated confounders matrices
`GxE_train`, `GxE_valid`, `GxE_test`	generated GxE matrix
`Beta_G`	main effect coefficients vector
`Beta_GxE`	interaction coefficients vector
`beta_0`	intercept coefficient value
`beta_E`	environment coefficient value
`Beta_C`	confounders coefficient values
`index_beta_non_zero`, `index_beta_gxe_non_zero`, `index_beta_zero`, `index_beta_gxe_zero`	inner data generation variables
`n_g_non_zero`	number of non-zero main effects generated
`n_gxe_non_zero`	number of non-zero interactions generated
`n_total_non_zero`	total number of non-zero variables
`SNR_g`	signal-to-noise ratio for the main effects
`SNR_gxe`	signal-to-noise ratio for the interactions
`family`, `p`, `sample_size`, `mode`, `seed`	input simulation parameters

Examples

data = data.gen(sample_size=100, p=100)
G = data$G_train; GxE = data$GxE_train
E = data$E_train; Y = data$Y_train
data = data.gen(sample_size=100, p=100)
G = data$G_train; GxE = data$GxE_train
E = data$E_train; Y = data$Y_train

Get model coefficients

Description

A function to obtain coefficients from the model fit object corresponding to the desired pair of tuning parameters lambda = (lambda_1, lambda_2).

Usage

gesso.coef(fit, lambda)
gesso.coef(fit, lambda)

Arguments

`fit`	model fit object obtained either by using function `gesso.fit` or `gesso.cv`
`lambda`	a pair of tuning parameters organized in a tibble (ex: `lambda = tibble(lambda_1=grid[1], lambda_2=grid[1])`)

Value

A list of model coefficients corresponding to lambda values of tuning parameters

`beta_0`	estimated intercept value
`beta_e`	estimated environmental coefficient value
`beta_g`	a vector of estimated main effect coefficients
`beta_c`	a vector of estimated confounders coefficients
`beta_gxe`	a vector of estimated interaction coefficients

Examples

data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, 
        parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g     
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, 
        parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g

Get model coefficients with specified number of non-zero interactions

Description

A function to obtain coefficients with target_b_gxe_non_zero specified to control the desired sparsity of interactions in the model.

Usage

gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)

Arguments

`cv_model`	cross-validated model fit object obtained by using function `gesso.cv`
`target_b_gxe_non_zero`	number of non-zero interactions we want to inlcude in the model
`less_than`	`TRUE` if we want to control a number of at most non-zero interactions, `FALSE` if we want to control a number of at least non-zero interactions

Value

A list of model coefficients corresponding to the best model that contains at most or at least target_b_gxe_non_zero non-zero interaction terms.

The target model is selected based on the averaged cross-validation (cv) results: for each pair of parameters lambda=(lambda_1, lambda_2) in the grid and each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full data with the selected tuning parameters.

Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.

`beta_0`	estimated intercept value
`beta_e`	estimated environmental coefficient value
`beta_g`	a vector of estimated main effect coefficients
`beta_gxe`	a vector of estimated interaction coefficients
`beta_c`	a vector of estimated confounders coefficients

Examples

data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
model_coefficients = gesso.coefnum(model, 5)
gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)              
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
model_coefficients = gesso.coefnum(model, 5)
gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)

Cross-Validation

Description

Performs nfolds-fold cross-validation to tune hyperparmeters lambda_1 and lambda_2 for the gesso model.

Usage

gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
         grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", 
         type_measure = "loss", fold_ids = NULL, nfolds = 4, 
         parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, 
         min_working_set_size = 100, verbose = TRUE)
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
         grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", 
         type_measure = "loss", fold_ids = NULL, nfolds = 4, 
         parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, 
         min_working_set_size = 100, verbose = TRUE)

Arguments

`G`	matrix of main effects of size `n x p`, variables organized by columns
`E`	vector of environmental measurments
`Y`	outcome vector. Set `family="gaussian"` for the continuous outcome and `family="binomial"` for the binary outcome with 0/1 levels
`C`	matrix of confounders of size `n x m`, variables organized by columns
`normalize`	`TRUE` to normalize matrix `G` and vector `E`
`normalize_response`	`TRUE` to normalize vector `Y` (for `family="gaussian"`)
`grid`	grid sequence for tuning hyperparameters, we use the same grid for `lambda_1` and `lambda_2`
`grid_size`	specify `grid_size` to generate grid automatically. Grid is generated by calculating `max_lambda` from the data (smallest lambda such that all the coefficients are zero). `min_lambda` is calculated as a product of `max_lambda` and `grid_min_ratio`. The program then generates `grid_size` values equidistant on the log10 scale from `min_lambda` to `max_lambda`
`grid_min_ratio`	parameter to determine `min_lambda` (smallest value for the grid of lambdas), default is 0.1 for p > n, 0.01 otherwise
`alpha`	if `NULL` independent 2D grid is used for (`lambda_1`, `lambda_2`), else 1D grid is used where `lambda_2` = `alpha` * `lambda_1`, i.e. (`lambda_1`, `alpha` * `lambda_1`)
`family`	`"gaussian"` for continuous outcome and `"binomial"` for binary
`type_measure`	loss to use for cross-validation. Specity `type_measure="loss"` for neative log likelihood or `type_measure="auc"` for AUC (for `family="binomial"` only)
`fold_ids`	option to input custom folds assignments
`tolerance`	tolerance for the dual gap convergence criterion
`max_iterations`	maximum number of iterations
`min_working_set_size`	minimum size of the working set
`nfolds`	number of cross-validation splits
`parallel`	`TRUE` to enable parallel cross-validation
`seed`	set random seed to control random folds assignments
`verbose`	`TRUE` to print messages

Value

A list of objects

`cv_result`	a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of (`lambda_1`, `lambda_2`) path. Could be used for custom parameters tuning (ex: select (`lambda_1`, `lambda_2`) with a sertain number of non-zero main effects and/or a sertain number of interactions). `mean_loss` averaged across folds loss value, vector of size `lambda_1``lambda_2` `mean_beta_g_nonzero` averaged across folds number of non-zero main effects, vector of size `lambda_1``lambda_2` `mean_beta_gxe_nonzero` averaged across folds number of non-zero interactions, vector of size `lambda_1`*`lambda_2` `lambda_1` `lambda_1` pass, decreasing `lambda_2` `lambda_2` pass, oscillating
`lambda_min`	a tibble of optimal (`lambda_1`, `lambda_2`) values, tuning parameter values that give minimum cross-validation loss (`mean_loss`)
`fit`	list, return of the function gesso.fit on the full data
`grid`	vector of values used for hyperparameters tuning
`full_cv_result`	inner variables

Examples

data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, 
                      grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe        
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g          
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, 
                      grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe        
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g

gesso fit

Description

Fits gesso model over the two dimentional grid of hyperparmeters lambda_1 and lambda_2, returns estimated coefficients for each pair of hyperparameters.

Usage

gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE,
          grid = NULL, grid_size = 20, grid_min_ratio = NULL, 
          alpha = NULL, family = "gaussian", weights = NULL,
          tolerance = 1e-3, max_iterations = 5000, 
          min_working_set_size = 100,
          verbose = FALSE)
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE,
          grid = NULL, grid_size = 20, grid_min_ratio = NULL, 
          alpha = NULL, family = "gaussian", weights = NULL,
          tolerance = 1e-3, max_iterations = 5000, 
          min_working_set_size = 100,
          verbose = FALSE)

Arguments

`G`	matrix of main effects of size `n x p`, variables organized by columns
`E`	vector of environmental measurments
`Y`	outcome vector. Set `family="gaussian"` for the continuous outcome and `family="binomial"` for the binary outcome with 0/1 levels
`C`	matrix of confounders of size `n x m`, variables organized by columns
`normalize`	`TRUE` to normalize matrix `G` and vector `E`
`normalize_response`	`TRUE` to normalize vector `Y`
`grid`	grid sequence for tuning hyperparameters, we use the same grid for `lambda_1` and `lambda_2`
`grid_size`	specify `grid_size` to generate grid automatically. Grid is generated by calculating `max_lambda` from the data (smallest lambda such that all the coefficients are zero). `min_lambda` is calculated as a product of `max_lambda` and `grid_min_ratio`. The program then generates `grid_size` values equidistant on the log10 scale from `min_lambda` to `max_lambda`
`grid_min_ratio`	parameter to determine `min_lambda` (smallest value for the grid of lambdas), default is 0.1 for p > n, 0.01 otherwise
`alpha`	if `NULL` independent 2D grid is used for (`lambda_1`, `lambda_2`), else 1D grid is used where `lambda_2` = `alpha` * `lambda_1`, i.e. (`lambda_1`, `alpha` * `lambda_1`)
`family`	`"gaussian"` for continuous outcome and `"binomial"` for binary
`tolerance`	tolerance for the dual gap convergence criterion
`max_iterations`	maximum number of iterations
`min_working_set_size`	minimum size of the working set
`weights`	inner fitting parameter
`verbose`	`TRUE` to print messages

Value

A list of estimated coefficients and other model fit metrics for each pair of hyperparameters (lambda_1, lambda_2)

`beta_0`	vector of estimated intercept values of size `lambda_1`*`lambda_2`
`beta_e`	vector of estimated environment coefficients of size `lambda_1`*`lambda_2`
`beta_g`	matrix of estimated main effects coefficients organized by rows, size (`lambda_1`*`lambda_2`) by `p`
`beta_gxe`	matrix of estimated interactions coefficients organized by rows, size (`lambda_1`*`lambda_2`) by `p`
`beta_c`	matrix of estimated confounders coefficients organized by rows, size (`lambda_1`*`lambda_2`) by `m`, where `m` is the number of confounders
`num_iterations`	number of iterations until convergence for each fit
`working_set_size`	maximum number of variables in the working set for each fit
`has_converged`	1 if the model converged within given `max_iterations`, 0 otherwise
`objective_value`	objective function (loss) value for each fit
`beta_g_nonzero`	number of estimated non-zero main effects for each fit
`beta_gxe_nonzero`	number of estimated non-zero interactions for each fit
`lambda_1`	`lambda_1` path values, decreasing
`lambda_2`	`lambda_2` path values, oscillating
`grid`	vector of values used for hyperparameters tuning

Examples

data = data.gen()
fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE)
plot(fit$beta_g_nonzero, pch=19, cex=0.4, 
     ylab="num of non-zero features", xlab="lambdas path")
points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
data = data.gen()
fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE)
plot(fit$beta_g_nonzero, pch=19, cex=0.4, 
     ylab="num of non-zero features", xlab="lambdas path")
points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")

Predict new outcome vector

Description

Predict new outcome vector based on the new data and estimated model coefficients.

Usage

gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, 
                   beta_c=NULL, new_C=NULL, family = "gaussian")
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, 
                   beta_c=NULL, new_C=NULL, family = "gaussian")

Arguments

`beta_0`	estimated intercept value
`beta_e`	estimated environmental coefficient value
`beta_g`	a vector of estimated main effect coefficients
`beta_gxe`	a vector of estimated interaction coefficients
`new_G`	matrix of main effects, variables organized by columns
`new_E`	vector of environmental measurments
`beta_c`	a vector of estimated confounders coefficients
`new_C`	matrix of confounders, variables organized by columns
`family`	set `family="gaussian"` for the continuous outcome and `family="binomial"` for the binary outcome with 0/1 levels

Value

Returns a vector of predicted values

Examples

data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train)
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e                   
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe     

new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
cor(new_Y, data$Y_test)^2
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train)
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e                   
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe     

new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
cor(new_Y, data$Y_test)^2

Selection metrics

Description

Calculates principal selection metrics for the binary zero/non-zero classification problem (sensitivity, specificity, precision, auc).

Usage

selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)

Arguments

`true_b_g`	vector of true main effect coefficients
`true_b_gxe`	vector of true interaction coefficients
`estimated_b_g`	vector of estimated main effect coefficients
`estimated_b_gxe`	vector of estimated interaction coefficients

Value

A list of principal selection metrics

`b_g_non_zero`	number of non-zero main effects
`b_gxe_non_zero`	number of non-zero interactions
`mse_b_g`	mean squared error for estimation of main effects effect sizes
`mse_b_gxe`	mean squared error for estimation of interactions effect sizes
`sensitivity_g`	recall of the non-zero main effects
`specificity_g`	recall of the zero main effects
`precision_g`	precision with respect to non-zero main effects
`sensitivity_gxe`	recall of the non-zero interactions
`specificity_gxe`	recall of the zero interactions
`precision_gxe`	precision with respect to non-zero interactions
`auc_g`	area under the curve for zero/non-zero binary classification problem for main effects
`auc_gxe`	area under the curve for zero/non-zero binary classification problem for interactions

Examples

data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g  
selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)

data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g  
selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)

Package 'gesso'

Help Index

Hierarchical GxE Interactions in a Regularized Regression Model

Description

Author(s)

References

Data Generation

Description

Usage

Arguments

Value

Examples

Get model coefficients

Description

Usage

Arguments

Value

Examples

Get model coefficients with specified number of non-zero interactions

Description

Usage

Arguments

Value

Examples

Cross-Validation

Description

Usage

Arguments

Value

Examples

gesso fit

Description

Usage

Arguments

Value

Examples

Predict new outcome vector

Description

Usage

Arguments

Value

Examples

Selection metrics

Description

Usage

Arguments

Value

Examples