groupWQS
VignetteWeighted Quantile Sum (WQS) regression (Carrico et al., 2015) was developed to assess the effect of chemical mixtures (usually from the environment) on the risk of an adverse event such as cancer. The method allows one to identify “bad actor” chemicals through non-zero weights where the data are highly correlated and high dimensional (typical in chemical concentration data). Studies have shown it to be more accurate in identifying these chemicals (via sensitivity and specificity) than traditional regression and regularization methods such as LASSO, adaptive LASSO, and elastic net (Czarnota et al., 2015 Cancer Informatics). WQS regression treats all chemicals as being part of one group and having the same direction of association with the outcome. Grouped Weighted Quantile Sum (GWQS) regression (Czarnota and Wheeler, 2016) extends this method to allow one to place chemicals into groups such that different magnitudes and direction of associations can be determined for each pre-defined group of chemicals. For example, some types of chemicals may have a harmful effect (i.e. increase risk of cancer) while others may have a protective effect.
Specifically, GWQS uses data with i=1,…,C components split between j=1,…,K groups. Within each of these K groups, the components are scored into quantiles that can be plausibly combined into an index. The weights wi are constrained to be between 0 and 1 and sum to 1, which in turn addresses potential issues with collinearity and high dimensionality. The general form of GWQS regression can be expressed as:
$$ \begin{equation} \tag{1} g(\mu)=\beta_{0}+\sum_{j=1}^{K}\left[\beta_{j}\left(\sum_{i=1}^{c_{j}} w_{j i} q_{j i}\right)\right]+z^{T} \phi \end{equation} $$
where wji represents the weight for the ith chemical component qji, and the summation $\sum^{c_i}_{i=1}w_{ji}q_{ji}$ represents a weighted index for the set of cj chemicals of interest within group j. The vector z is a vector of covariates for which to adjust. On the right hand side, g is a monotonic and differentiable link function that relates to the mean μ. The link function used depends on the type of outcome (e.g. continuous or binary). For inference, the equation (1) is run for B bootstraps where in each bootstrap sample the significance of the estimated vector of weights is evaluated through the significance (P≤0.05) of β̂j, the parameter estimate of weighted index j. The estimated index for each of the j chemical groups is obtained by $GWQS_j = \sum^{c_i}_{i=1}\bar{w_{ji}}q_{ji}$ where $\bar{w_{ji}}$ is the weighted average of the β̂j test statistics for component i. Significance of the group effects is assessed using the following model with the test dataset:
$$ \tag{2} g(\mu)=\beta_{0}+\sum_{j=1}^{K}\left[GWQS_j\right]+z^{T} \phi $$
Nonlinear optimization of (1) using the solnp
function
from R is used to estimate the unknown weights and beta parameters. This
function uses an augmented Lagrange multiplier method with a sequential
quadratic programming interior algorithm.
The main function of groupWQS
is gwqs.fit
,
which implements GWQS regression. The functions make.X
and
make.x.s
are used to prepare data for regression, where
make.X
organizes user data into a matrix of desired groups
while make.x.s
forms a vector detailing the order and size
of each group in the exposure matrix. The output of these two functions
are used as arguments in the main gwqs.fit
function. The
model fit object that is returned can be fed into the
plot.weights
function, which visualizes the weight
estimated for individual chemicals in each of the regressed groups. We
provide an example analysis below to illustrate the usage of
groupWQS
.
groupWQS
The package groupWQS
requires a dataframe where
component variables used to construct indices are named. As an example,
we will look at simdata
:
data(simdata)
head(simdata)
#> pcb_118 pcb_138 pcb_153 pcb_180 pcb_192 as cu pb
#> 1 2.360824 2.700214 3.536807 1.3630520 3.568884 4.332258 3.4998226 7.6088824
#> 2 2.788382 1.353622 2.328022 3.3261689 -2.020155 -1.146443 0.8174605 -1.6830696
#> 3 6.246413 3.166612 6.851131 7.2149506 10.504976 5.042383 6.2631718 7.2252882
#> 4 3.642265 2.881070 3.760282 4.3231974 3.368709 2.814906 3.9192663 2.1143266
#> 5 3.481870 3.820329 3.178467 0.9451579 3.786788 6.331203 2.2228106 4.8262313
#> 6 4.088892 4.799032 5.467218 4.2547060 6.078460 2.622098 1.8577607 0.6898549
#> sn carbaryl propoxur methoxychlor diazinon chlorpyrifos Y
#> 1 3.975121 4.7416076 4.203351 4.598234 5.5571762 5.4269398 0
#> 2 1.710898 -0.4558587 2.255870 2.751266 2.6322118 -0.6704027 0
#> 3 4.857086 7.2872371 5.274585 6.739046 12.0197435 8.1984625 0
#> 4 3.798077 0.7455287 3.111616 2.940376 0.6781079 1.5744175 0
#> 5 4.409647 0.6547444 2.309184 3.652926 0.6819997 1.7293713 1
#> 6 4.374542 4.3322814 4.672852 4.023438 6.7486293 5.9936270 1
These data represent chemical concentrations for 14 suspected
carcinogens found in the environment, and the Y
variable
represents our case-control outcomes, 1 for cancer and 0 for
cancer-free.
To prepare our data and let gwqs.fit
know how many
groups we want in our model, as well as which variables belong to which
group, we use two functions:
make.X
make.X
organizes component variables into a matrix
suitable for analysis. To specify variable groups, we make a list of
string vectors, each vector corresponding to a group and each string
corresponding to a variable name. This list is then used as an argument
in make.X
.
group_list <- list(c("pcb_118", "pcb_138", "pcb_153", "pcb_180", "pcb_192"),
c("as", "cu", "pb", "sn"),
c("carbaryl", "propoxur", "methoxychlor", "diazinon", "chlorpyrifos"))
X <- make.X(simdata, num.groups = 3, groups = group_list)
head(X)
#> pcb_118 pcb_138 pcb_153 pcb_180 pcb_192 as cu
#> [1,] 2.360824 2.700214 3.536807 1.3630520 3.568884 4.332258 3.4998226
#> [2,] 2.788382 1.353622 2.328022 3.3261689 -2.020155 -1.146443 0.8174605
#> [3,] 6.246413 3.166612 6.851131 7.2149506 10.504976 5.042383 6.2631718
#> [4,] 3.642265 2.881070 3.760282 4.3231974 3.368709 2.814906 3.9192663
#> [5,] 3.481870 3.820329 3.178467 0.9451579 3.786788 6.331203 2.2228106
#> [6,] 4.088892 4.799032 5.467218 4.2547060 6.078460 2.622098 1.8577607
#> pb sn carbaryl propoxur methoxychlor diazinon
#> [1,] 7.6088824 3.975121 4.7416076 4.203351 4.598234 5.5571762
#> [2,] -1.6830696 1.710898 -0.4558587 2.255870 2.751266 2.6322118
#> [3,] 7.2252882 4.857086 7.2872371 5.274585 6.739046 12.0197435
#> [4,] 2.1143266 3.798077 0.7455287 3.111616 2.940376 0.6781079
#> [5,] 4.8262313 4.409647 0.6547444 2.309184 3.652926 0.6819997
#> [6,] 0.6898549 4.374542 4.3322814 4.672852 4.023438 6.7486293
#> chlorpyrifos
#> [1,] 5.4269398
#> [2,] -0.6704027
#> [3,] 8.1984625
#> [4,] 1.5744175
#> [5,] 1.7293713
#> [6,] 5.9936270
We should note that the data in X
must be chemical
concentrations, which will be converted to quantiles during
analysis.
make.x.s
make.x.s
creates a vector which tells
gwqs.fit
the size and order of our groups. Conveniently, it
uses the same arguments as make.X
.
With these steps all chemical concentration data have been prepared for analysis. To complete data preparation we define our outcome object
and split our data into training and validation sets. The training data will be used to estimate the weights in the indices and the validation data will be used to estimate the exposure effects and perform significance testing.
To fit a model we will use the gwqs.fit
function. First
we pass the objects defined above (note: If arguments
are not supplied for y.train
and x.train
the
y
and x
data passed will be used for both
training and validation). In addition to the x
and
y
parameters there are z
and
z.train
parameters for covariates, but as we are not
including covariates in this model these paramters will be left empty.
The B
parameter tells the function how many bootstrap
samples to use in estimation and n.quantiles
specifies the
number of quantiles to use. The func
parameter tells the
function which outcome type to fit the model for (continuous and binary
outcomes are currently supported). In the example below, five quantiles
(quintiles) are used for the chemicals.
results <- gwqs.fit(y = Y.valid, y.train = Y.train, x = X.valid, x.train = X.train, x.s = x.s,
B=100, n.quantiles = 5, func = "binary")
If there are convergence problems after model fitting the
tol
and delta
parameters can be adjusted. A
larger tol
value will relax the criteria for convergence
and a larger delta
will increase the step size for the
bootstrap procedure; both will speed up convergence.
summary(results$fit)$coefficients
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.34496522 0.27666370 -1.246876 0.2124431005
#> GWQS1 -0.12867603 0.10493642 -1.226229 0.2201126778
#> GWQS2 0.01886734 0.10967855 0.172024 0.8634186734
#> GWQS3 0.36015377 0.09277427 3.882044 0.0001035823
The results above are the regression coefficients for the estimated exposure indices. This is in the validation set.
Next is the AIC for the model we have fit.
exp(confint(results$fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.4097283 1.215472
#> GWQS1 0.7144035 1.078959
#> GWQS2 0.8215665 1.264150
#> GWQS3 1.1982785 1.725168
And finally are the 95% CIs for the odds ratios of the indicies. We can see from the results that of our three groups only the third, GWQS3, is statistically significant at the 0.05 level, and has a 95% CI of (1.2, 1.7) for its odds ratio.
The weights for individiual chemical components in each group are
listed in the object returned by gwqs.fit
. To view them
graphically in descending order of importance, we can use the
plot.weights
function. Besides the model fit object, we
just need to create a string vector naming our groups. This vector’s
order matters - names must match their corresponding index in the group
list defined above.
Our one significant index was a group of insecticide compounds. Of the five chemicals it appears methoxychlor was mostly responsible for the association GWQS3 had with the outcome variable.