gesso
This vignette is a tutorial about how to use the gesso
package
The package is developed to fit a regularized regression model that we call gesso for the joint selection of gene-environment (GxE) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We developed and implemented an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy.
gesso model induces hierarchical selection of the (GxE) interaction terms via overlapped group lasso structure. The model has two tuning parameters λ1 and λ2 allowing flexible, data-dependent control over group sparsity and additional interactions sparsity.
Response (outcome) variable can be either continuous or binary 0/1 variable. For a binary response, the IRLS procedure is used with the custom screening rules we developed.
The package supports sparse matrices dgCMatrix
and
(filebacked) bigmatrix format from the bigmemory
package
for large or out of RAM datasets.
The package allows
NOTE: RcppThread
cannot be installed with
gcc
4.8.5 version https://github.com/tnagler/RcppThread/issues/13.
In this case, we recommend updating gcc
to a more recent
version.
Installation can take a couple of minutes because of the requirement
to install dependent packages (dplyr
, Rcpp
,
RcppEigen
, RcppThread
, BH
,
bigmemory
)
## install.packages("devtools")
library(devtools)
devtools::install_github("NataliaZemlianskaia/gesso")
Attaching libraries for the examples below
First, we generate the dataset using data.gen()
function
from the package. The function allows us to generate a binary (0/1)
matrix of genotypes G
of a given sample_size
and p
number of features, binary vector E
of
environmental measurements, and a response variable Y
∼ g(β0 + βEE + βGG + βG × EG × E)
- either quantitative or binary depending on the family
parameter.
We can specify the number of non-zero main effects
(n_g_non_zero
) and interactions
(n_gxe_non_zero
) we want to generate. We also specified a
strong_hierarchical mode for our dataset, which means that (1)
if interaction effect is generated as non-zero, it’s respective genetic
main effect is also generated as non-zero (βG × E ≠ 0 → βG ≠ 0),
and (2) effect sizes of the main effects are larger than that of
interaction effects (|βG| ≥ |βG × E|).
family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5
data = data.gen(seed=1, sample_size=sample_size, p=p,
n_g_non_zero=n_g_non_zero,
n_gxe_non_zero=n_gxe_non_zero,
mode="strong_hierarchical",
family=family)
Let’s look at the dataset we generated data$G_train
,
data$E_train
, and data$Y_train
## [1] 180 400
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 1 0 0 1 0
## [5,] 0 0 0 0 1
## [6,] 1 0 0 0 0
## [1] 0 0 0 0 0 0 0 1 0 1
## [1] -9.546546 -2.713549 5.875298 1.984317 2.432890 6.798349
We generated model coefficients data$Beta_G
and
data$Beta_GxE
such that we got
sum(data$Beta_G != 0)
= 10 non-zero main effects and
sum(data$Beta_GxE != 0)
= 5 non-zero interaction effects as
we specified.
## [1] 10 5
We also imposed strong_hierarchical relationships between main effects and interaction effects
## [,1] [,2]
## [1,] -3 0.0
## [2,] -3 0.0
## [3,] 3 0.0
## [4,] 3 0.0
## [5,] 3 0.0
## [6,] 3 1.5
## [7,] 3 -1.5
## [8,] -3 -1.5
## [9,] -3 1.5
## [10,] 3 1.5
To tune the model over a 2D grid of hyper-parameters
(lambda_1
and lambda_2
) we use the function
gesso.cv()
, where we specify
G, E, Y
and family
parameter for
the response variable; matrix G
should be a numeric matrix,
vector E
should be a numeric vector. Vector Y
could either be a continuous or binary 0/1 numeric vectortolerance
for the convergence of the fitting
algorithmgrid_size
to automatically generate grid values for
cross-validationgrid_min_ratio
is an important parameter that controls
the sparsity of the coefficients vector. For substantially
high-dimensional problems where p
>> n
or when the goal is to select only a few interactions we recommend
setting grid_min_ratio
to 0.1nfolds
to specify how many folds to use in
cross-validationparallel
TRUE to enable parallel cross-validation (TRUE
by default)seed
set the random seed to control random folds
assignmentsnormalize
TRUE to normalize matrix G
(such
that column-wise sd is equal to 1) and vector E
(TRUE by
default)normalize_response
TRUE to normalize vector
Y
(FALSE by default)There is an option to use a custom grid in gesso.cv()
function by specifying grid
parameter instead of
grid_size
parameter. The user can also specify
alpha
parameter to enable tuning only over
lambda_1
while assuming lambda_2
=
lambda_1
* alpha
.
verbose
parameter is set to TRUE by default in
gesso.cv()
, but we can set it to FALSE to avoid outputting
function messages about partial execution time.
start = Sys.time()
tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train,
family=family, grid_size=20, tolerance=1e-4,
grid_min_ratio=1e-2,
parallel=TRUE, nfolds=3,
normalize=TRUE,
normalize_response=TRUE,
seed=1,
max_iterations=10000)
## Compute grid:
## Time difference of 0.0005173683 secs
## Parallel cv:
## Time difference of 1.47404 secs
## Fit on the full dataset:
## Time difference of 0.7043262 secs
## Time difference of 2.185213 secs
gesso.coef()
functionTo obtain interaction and main effect coefficients corresponding to
the model with a particular pair of tuning parameters we use
gesso.coef()
function. We need to specify a model fit
object and a pair of lambda
= (lambda_1
,
lambda_2
) values organized in a tibble (ex:
lambda = tibble(lambda_1=tune_model$grid[1], lambda_2=tune_model$grid[1])
).
Below we set fit = tune_model$fit
- model fit on the
full dataset and lambda = tune_model$lambda_min
- the pair
of tuning parameters corresponding to the minimum cross-validated error
that gesso.cv()
function returns.
coefficients = gesso.coef(fit=tune_model$fit, lambda=tune_model$lambda_min)
gxe_coefficients = coefficients$beta_gxe
g_coefficients = coefficients$beta_g
Check if all non-zero interaction features were recovered by the model
## [,1] [,2]
## [1,] 1.5 1.4029323
## [2,] -1.5 -1.0089132
## [3,] -1.5 -1.1648283
## [4,] 1.5 0.7450604
## [5,] 1.5 1.3221858
Check that the largest estimated interaction effects correspond to the true non-zero coefficients
## [1] 1.5 1.5 -1.5 -1.5 1.5 0.0 0.0 0.0 0.0 0.0
Calculate principal selection metrics with
selection.metrics()
funciton available in the package
selection_gesso = selection.metrics(true_b_g=data$Beta_G, true_b_gxe=data$Beta_GxE,
estimated_b_g=g_coefficients,
estimated_b_gxe=gxe_coefficients)
cbind(selection_gesso)
## selection_gesso
## b_g_non_zero 46
## b_gxe_non_zero 22
## mse_b_g 0.1957665
## mse_b_gxe 0.4391993
## sensitivity_g 1
## specificity_g 0.9076923
## precision_g 0.2173913
## sensitivity_gxe 1
## specificity_gxe 0.956962
## precision_gxe 0.2272727
## auc_g 1
## auc_gxe 1
Compare with the standard Lasso model (we use glmnet
package)
set.seed(1)
tune_model_glmnet = cv.glmnet(x=cbind(data$E_train, data$G_train,
data$G_train * data$E_train),
y=data$Y_train,
nfolds=3,
family=family)
coef_glmnet = coef(tune_model_glmnet, s=tune_model_glmnet$lambda.min)
g_glmnet = coef_glmnet[3: (p + 2)]
gxe_glmnet = coef_glmnet[(p + 3): (2 * p + 2)]
cbind(data$Beta_GxE[data$Beta_GxE != 0], gxe_glmnet[data$Beta_GxE != 0])
## [,1] [,2]
## [1,] 1.5 1.35563472
## [2,] -1.5 0.00000000
## [3,] -1.5 0.00000000
## [4,] 1.5 0.03736643
## [5,] 1.5 0.93958681
## [1] 1.5 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
selection_glmnet = selection.metrics(data$Beta_G, data$Beta_GxE, g_glmnet, gxe_glmnet)
cbind(selection_gesso, selection_glmnet)
## selection_gesso selection_glmnet
## b_g_non_zero 46 25
## b_gxe_non_zero 22 29
## mse_b_g 0.1957665 0.2679464
## mse_b_gxe 0.4391993 1.181034
## sensitivity_g 1 1
## specificity_g 0.9076923 0.9615385
## precision_g 0.2173913 0.4
## sensitivity_gxe 1 0.6
## specificity_gxe 0.956962 0.9341772
## precision_gxe 0.2272727 0.1034483
## auc_g 1 1
## auc_gxe 1 0.7756962
We see that the gesso model outperforms the Lasso model in terms of
the detection of interaction terms in all important selection metrics,
including GxE detection AUC (auc_gxe
), sensitivity
(sensitivity_gxe
), and precision
(precision_gxe
). The estimation bias of the interaction
coefficients is also substantially lower for the gesso model
(mse_b_gxe
).
gesso.coefnum()
functionTo control how parsimonious is our model with respect to the selected
non-zero interaction terms we can use gesso.coefnum()
function. We obtain interaction and main effect coefficients
corresponding to the target GxE model, where instead of specifying
tuning parameters lambda
, we set
target_b_gxe_non_zero
parameter which is the number of
at most (specify less_than=TRUE
- default value)
or at least (specify less_than=FALSE
) non-zero
interactions we want to include in the model, depending on the parameter
less_than
.
The target model is selected based on the averaged cross-validation
(cv) results (tune_model$cv_result): for each pair of parameters
lambda
=(lambda_1
, lambda_2
) in
the grid and for each cv fold we obtain a number of non-zero estimated
interaction terms, then average cv results by lambda and choose the
tuning parameters corresponding to the minimum average cv loss that have
at most or at least target_b_gxe_non_zero
non-zero interaction terms. Returned coefficients are obtained by
fitting the model on the full dataset with the selected tuning
parameters.
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
coefficients = gesso.coefnum(cv_model=tune_model, target_b_gxe_non_zero=5)
gxe_coefficients = coefficients$beta_gxe
g_coefficients = coefficients$beta_g
Calculate principal selection metrics
selection_gesso = selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients,
gxe_coefficients)
cbind(selection_gesso, selection_glmnet)
## selection_gesso selection_glmnet
## b_g_non_zero 34 25
## b_gxe_non_zero 5 29
## mse_b_g 0.341325 0.2679464
## mse_b_gxe 1.153749 1.181034
## sensitivity_g 1 1
## specificity_g 0.9384615 0.9615385
## precision_g 0.2941176 0.4
## sensitivity_gxe 0.8 0.6
## specificity_gxe 0.9974684 0.9341772
## precision_gxe 0.8 0.1034483
## auc_g 1 1
## auc_gxe 0.8992405 0.7756962
Here we see that the number of non-zero interactions in the model
(b_gxe_non_zero
) is 5 (vs 22 non-zero terms in the model
corresponding to the minimal cross-validated error), leading to
substantially increased precision and decreased sensitivity for GxE
detection.
With gesso.coef()
we obtain coefficients corresponding
to the best model in terms of minimal cross-validated error and use
gesso.predict()
function to apply our estimated model to a
new test dataset that can also be generated by the function
data.gen()
.
We calculate test R-squared to assess model performance.
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe
new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
test_R2_gesso = cor(new_Y, data$Y_test)^2
Compare with the standard Lasso (we use glmnet
package)
new_Y_glmnet = predict(tune_model_glmnet, newx=cbind(new_E, new_G, new_G * new_E),
s=tune_model_glmnet$lambda.min)
test_R2_glmnet = cor(new_Y_glmnet[,1], data$Y_test)^2
cbind(test_R2_gesso, test_R2_glmnet)
## test_R2_gesso test_R2_glmnet
## [1,] 0.983552 0.9662947
The package allows adding unpenalized covariates to the model (for
example, important adjustment demographic variables like age and
gender). In this example, we first generate data with additional
covariates (specify n_confounders
parameter in
data.gen()
) and then show how to fit the model, the user
just needs to specify numeric matrix C
of covariates
organized by columns.
family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5
n_confounders = 2
grid = 10^seq(-3, log10(1), length.out = 20)
data = data.gen(seed=1, sample_size=sample_size, p=p,
n_g_non_zero=n_g_non_zero,
n_gxe_non_zero=n_gxe_non_zero,
mode="strong_hierarchical",
family=family,
n_confounders=n_confounders)
tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train,
C=data$C_train,
family=family, grid=grid, tolerance=1e-4,
parallel=TRUE, nfolds=3,
normalize=TRUE,
normalize_response=TRUE,
verbose=FALSE,
seed=1)
To convert data matrix to the sparse format use as
function.
G_train_sparse = as(data$G_train, "dgCMatrix")
start = Sys.time()
fit = gesso.fit(G=G_train_sparse, E=data$E_train, Y=data$Y_train,
tolerance=1e-4,
grid_size=20, grid_min_ratio=1e-1,
normalize=TRUE,
normalize_response=TRUE)
time_sparse = difftime(Sys.time(), start, units="secs"); time_sparse
## Time difference of 0.09664106 secs
bigmemory
package) option exampleFor out of RAM objects function
attach.big.matrix("g_train.desc")
can be used to upload the
data given that the files are already created in a correct format
(g_train.desc
, g_train.bin
).
## how to create filebacked matrix
G_train = filebacked.big.matrix(nrow=dim(data$G_train)[1],
ncol=dim(data$G_train)[2],
backingpath="./",
backingfile="g_train.bin",
descriptorfile="g_train.desc")
for (i in 1:dim(data$G_train)[2]){
G_train[, i] = data$G_train[, i]
}
## how to attach filebacked matrix
G_train = attach.big.matrix("g_train.desc")
## is.filebacked(G_train) should return TRUE
fit_bm = gesso.fit(G_train, data$E_train, data$Y_train, family=family)
tune_model_bm = gesso.cv(G_train, data$E_train, data$Y_train, family=family)
We use the same dataset we generated to demonstrate how our screening rules work.
Working sets (WS) are the sets of variables left after we applied our screening rules to the full set of predictors. Histogram of the working set sizes shows that most of the time we have to fit only a handful of variables.
2-dimensional plot below shows the log10(WS size) for each (λ1, λ2) fit.
df = data.frame(lambda_1_factor = factor(fit$lambda_1),
lambda_2_factor = factor(fit$lambda_2),
ws = fit$working_set_size)
log_0 = function(x){
return(ifelse(x == 0, 0, log10(x)))
}
ggplot(df, aes(lambda_1_factor, lambda_2_factor, fill=log_0(ws))) +
scale_fill_distiller(palette = "RdBu") +
scale_x_discrete("lambda_1", breaks=c(1)) +
scale_y_discrete("lambda_2", breaks=c(1)) +
labs(fill='log WS') +
geom_tile()