This vignette provides instructions on how to perform marginal modeling for exposure data with repeated measures and non-detects. Environmental exposure and biomonitoring data from environmental and occupational studies often exhibit right-skewed distributions and left-censored. Left censoring occurs when laboratory instruments have a limit of detection (LOD) below which measurements are not provided.
To conduct regression models for repeated measures data with non-detects, it is necessary to install the R package:
After building and installing the package, you can view the vignette by using the command:
To impute values that are below the LOD, the function
Fillin()
can be used. This section demonstrates the
function using an example from Ganser and Hewett (2010). The available
substitution methods for this function include: “None”, “LOD”, “LOD2”,
“LODS2”, and “QQplot”. A detailed description of each substitution
method can be found in the Reference Manual of the Documentation.
y <- c(0,0,0,3.06,4.41,7.23,8.29,9.52,19.94,20.25)
## Limit of detection (LOD) = 3
lod <- 3
Fillin(y, lod, "BetaMean")
#> [1] 1.762431 1.762431 1.762431 3.060000 4.410000 7.230000 8.290000
#> [8] 9.520000 19.940000 20.250000
Fillin(y, lod, "BetaGM")
#> [1] 1.482424 1.482424 1.482424 3.060000 4.410000 7.230000 8.290000
#> [8] 9.520000 19.940000 20.250000
Conventional statistical analyses often assume that responses or outcome data follow a normal distribution. If the underlying distribution is log-normal, data transformation using the natural logarithm required. To analyze such data, marginal models utilizing estimation methods such as generalized estimating equations (GEE), quadratic inference functions (QIF), and generalized method of moments (GMM) are employed. These methods incorporate the imputation of measurements below the LOD using single and multiple value imputation techniques (Chen et al., 2024).
The 15th dataset from a simulation study includes 100 subjects (sample size is 100), each with three repeated measurements. The independent variables or covariates (x1 and x2) are simulated from a Bernoulli distribution with a parameter value of p = 0.5 and a uniform distribution U(0, 1), respectively. Correlated errors for repeated measures models are accounted for and assumed to follow a multivariate normal distribution, MVN(0, R(α)). A first-order autoregressive (AR-1) correlation structure with a correlation parameter of α = 0.7 is incorporated. The true values of 1, 1, and 1 correspond to the marginal intercept and two slopes.
To conduct regression analysis using the 15th simulated dataset,
assume the LOD is 2; thus, any measurements below 2 would be imputed
using substitution approaches, such as “QQplot”. For repeated measures
in a cluster study, an “exchangeable” correlation structure is used,
while “AR-1” is appropriate for longitudinal datasets where subjects are
measured over time. The example below employs the function
Modified.GEE()
for the GEE method. The atomic vector
“typed” specifies the types of time-dependent covaraites, with the
length of the vector equal to the number of regression parameters,
excluding the intercept. For time-independent covariates or those in a
cluster study, “1” is assigned, thus “c(1,1)” is used for “typed”. The
maximum number of iterations is set to 1,000.
id=as.matrix(as.vector(t(simdata15$id)))
y=as.matrix(as.vector(t(simdata15$y)))
x1=as.matrix(as.vector(t(simdata15$x1)))
x2=as.matrix(as.vector(t(simdata15$x2)))
x=cbind(x1,x2)
## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%).
lod=2
## Intercept is not included in the "x" and "typed".
## Modified.GEE(id, y, x, lod, substitue, corstr, typetd, maxiter)
Modified.GEE(id, y, x, lod, "QQplot", "AR-1", c(1,1), 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).
Parameter | Estimate | Standard Error | Wald Statistic | p-value | TECM | CIC |
---|---|---|---|---|---|---|
Typical Asymptotic SE | ||||||
0 | 1.41889291637309 | 0.117296637893661 | 12.0966205157511 | 0 | 0.08268899246403 | 6.63582649603307 |
1 | 0.708428945190931 | 0.120271120631658 | 5.89026643694926 | 5.55713504102329e-08 | 0 | 0 |
2 | 1.14012150968324 | 0.233378123963834 | 4.88529726059466 | 4.07412042346955e-06 | 0 | 0 |
Bias-Corrected SE (MD) | ||||||
0 | 1.41889291637309 | 0.121569957897582 | 11.671410773774 | 0 | 0.0891285922808028 | 7.1239067842054 |
1 | 0.708428945190931 | 0.124580417761904 | 5.6865192613567 | 1.36804927164391e-07 | 0 | 0 |
2 | 1.14012150968324 | 0.242547020447278 | 4.70062055423626 | 8.57351781924365e-06 | 0 | 0 |
Bias-Corrected SE (KC) | ||||||
0 | 1.41889291637309 | 0.119412406036418 | 11.8822906552972 | 0 | 0.085845453509029 | 6.8752373440091 |
1 | 0.708428945190931 | 0.122403870291727 | 5.78763517446404 | 8.76355334966661e-08 | 0 | 0 |
2 | 1.14012150968324 | 0.237914739625835 | 4.79214323364871 | 5.94118008701017e-06 | 0 | 0 |
Bias-Corrected SE (AVG) | ||||||
0 | 1.41889291637309 | 0.120496011092954 | 11.7754347509356 | 0 | 0.085845453509029 | 6.99957206410725 |
1 | 0.708428945190931 | 0.12349693913641 | 5.73640893567756 | 1.0986445742045e-07 | 0 | 0 |
2 | 1.14012150968324 | 0.240242045091072 | 4.74572013092478 | 7.15943746354419e-06 | 0 | 0 |
Estimated correlation | 0.664821305215864 | 0 | 0 | 0 | 0 | 0 |
Estimated dispersion | 0.511033918398329 | 0 | 0 | 0 | 0 | 0 |
The QIF method is performed using the function
Modified.GEE()
, which typically demonstrates efficiency
advantages over GEE in large sample sizes.
## Gets initial estimates for the QIF approach through independence structure
initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian)
beta_initial=as.matrix(initial$coefficients)
## Intercept is not included in the "x" and "typed".
## Modified.QIF(id, y, x, lod, substitue, corstr, beta, typetd, maxiter)
Modified.QIF(id, y, x, lod, "QQplot", "exchangeable", beta_initial, c(1,1), 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).
Parameter | Estimate | Standard Error | Wald Statistic | p-value | TECM | CIC | |
---|---|---|---|---|---|---|---|
Typical Asymptotic SE | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.163267610463611 | 8.8042835119575 | 5.10702591327572e-14 | 0.123896235470461 | 7.60882967081081 |
x1 | 1 | 0.71367264050939 | 0.14893769126106 | 4.79175307785897 | 5.95052545859787e-06 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.273966215007961 | 4.10239720218527 | 8.51430721415802e-05 | 0 | 0 |
Bias-Corrected SE (MD) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.123035028606317 | 11.6832933443774 | 0 | 0.0911928228535924 | 5.86921377221323 |
x1 | 1 | 0.71367264050939 | 0.126351479859633 | 5.6483124796182 | 1.61730083991785e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.24513365359884 | 4.58492017493949 | 1.3554024130169e-05 | 0 | 0 |
Bias-Corrected SE (KC) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.12084505122269 | 11.8950202453269 | 0 | 0.0878152589014621 | 5.66351137004652 |
x1 | 1 | 0.71367264050939 | 0.124140764428436 | 5.74889838801342 | 1.0398148631019e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.240417975832863 | 4.67485107986805 | 9.49951150097661e-06 | 0 | 0 |
Bias-Corrected SE (AVG) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.121944956167059 | 11.7877309240427 | 0 | 0.0878152589014621 | 5.76636257112988 |
x1 | 1 | 0.71367264050939 | 0.125250999707778 | 5.69793967452917 | 1.30116281304993e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.242787264112316 | 4.6292306066846 | 1.13820322262814e-05 | 0 | 0 |
Another estimation method, GMM, is employed using the function
Modified.GMM()
.
## Gets initial estimates for the GMM approach through independence structure
initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian)
beta_initial=as.matrix(initial$coefficients)
## Intercept is not included in the "x" and "typed".
## Modified.GMM(id, y, x, lod, substitue, beta, maxiter)
Modified.GMM(id, y, x, lod, "QQplot", beta_initial, 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).
Parameter | Estimate | Standard Error | Wald Statistic | p-value | TECM | CIC | |
---|---|---|---|---|---|---|---|
Typical Asymptotic SE | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.118697699482703 | 12.1102122206753 | 0 | 0.0845688751172119 | 5.46551267184986 |
x1 | 1 | 0.71367264050939 | 0.121974803192421 | 5.85098415271505 | 6.6182999303166e-08 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.23580050602341 | 4.76639449548242 | 6.5895608287736e-06 | 0 | 0 |
Bias-Corrected SE (MD) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.123035028606303 | 11.6832933443787 | 0 | 0.0911928228535799 | 5.86921377221281 |
x1 | 1 | 0.71367264050939 | 0.126351479859632 | 5.64831247961822 | 1.61730083991785e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.245133653598822 | 4.58492017493984 | 1.3554024130169e-05 | 0 | 0 |
Bias-Corrected SE (KC) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.120845051222677 | 11.8950202453283 | 0 | 0.0878152589014506 | 5.66351137004614 |
x1 | 1 | 0.71367264050939 | 0.124140764428435 | 5.74889838801346 | 1.0398148631019e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.240417975832846 | 4.67485107986838 | 9.49951150097661e-06 | 0 | 0 |
Bias-Corrected SE (AVG) | |||||||
(Intercept) | 0 | 1.43745433084147 | 0.121944956167046 | 11.787730924044 | 0 | 0.0878152589014506 | 5.76636257112948 |
x1 | 1 | 0.71367264050939 | 0.125250999707777 | 5.69793967452921 | 1.30116281304993e-07 | 0 | 0 |
x2 | 2 | 1.12391823394195 | 0.242787264112298 | 4.62923060668493 | 1.13820322262814e-05 | 0 | 0 |
To handle time-dependent covariates, we use the 58th dataset from a simulation study, which includes 100 subjects (sample size is 100), each with three measurements collected over time. Detailed model mechanism are described in the setting II for type III time-dependent covariate on page 90 of Lai and Small (2007).
data(simdata58)
head(simdata58)
#> y int x1 id visit
#> 1 2.13315183 1 -0.1061468 1 1
#> 2 -0.93398126 1 -0.1406264 1 2
#> 3 0.38660420 1 -0.2905621 1 3
#> 4 0.70977961 1 0.1533462 2 1
#> 5 -0.07722418 1 0.3018084 2 2
#> 6 0.71894327 1 -0.1685652 2 3
Failure to account for time-dependency in covariates can lead to
inefficient regression parameter estimation. The function
Selected.GEE()
allows the selection of a type of
time-dependent covariate through a marginal mean regression model using
the GEE estimation method for longitudinal data with values below the
LOD. The selection approach applied to choose a type of time-dependency
is the empirical mean squared error (MSE) minimization criterion (Chen
and Westgate, 2017, 2019). Given the longitudinal nature of the dataset,
“AR-1” is used as the working correlation structure.
id=as.matrix(as.vector(t(simdata58$id)))
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05
## Intercept is not included in the "x".
## Selected.GEE(id, y, x, lod, substitue, corstr, maxiter)
Selected.GEE(id, y, x1, lod, "MIWithID", "AR-1", 1000)
#> Selected type of time-dependent covariate
#> 3
#> Results based on the empirical MSE minimization criterion (EMMC)
Type | EMMC |
---|---|
1 | 0.0015363 |
2 | 0.0014186 |
3 | 0.0011418 |
Once the type of time-dependent covaraite is selected, i.e., in this
case, type 3, it can be updated in the “typed” vector
of the function Modified.GEE()
.
id=as.matrix(as.vector(t(simdata58$id)))
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
Modified.GEE(id, y, x1, lod, "MIWithID", "AR-1", c(3), 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).
Parameter | Estimate | Standard Error | Wald Statistic | p-value | TECM | CIC |
---|---|---|---|---|---|---|
Typical Asymptotic SE | ||||||
0 | 0.452702945566748 | 0.0373466589720912 | 12.1216450956175 | 0 | 0.00249541946313904 | 2.18313710952326 |
1 | 0.250741030624437 | 0.0331759932294629 | 7.55790576909568 | 2.19827489544855e-11 | 0 | 0 |
Bias-Corrected SE (MD) | ||||||
0 | 0.452702945566748 | 0.0378597367787422 | 11.9573717115999 | 0 | 0.00258300405003841 | 2.26107917431158 |
1 | 0.250741030624437 | 0.0339064061953308 | 7.3950931036438 | 4.83537654361044e-11 | 0 | 0 |
Bias-Corrected SE (KC) | ||||||
0 | 0.452702945566748 | 0.037602008338319 | 12.0393288968402 | 0 | 0.00253863545272408 | 2.22158843525397 |
1 | 0.250741030624437 | 0.0335369113313834 | 7.47656897043457 | 3.26088045454753e-11 | 0 | 0 |
Bias-Corrected SE (AVG) | ||||||
0 | 0.452702945566748 | 0.0377310926162406 | 11.9981403711562 | 0 | 0.00253863545272408 | 2.24133380478277 |
1 | 0.250741030624437 | 0.0337221648380693 | 7.43549626272428 | 3.97779587046898e-11 | 0 | 0 |
Estimated correlation | 0.140270243814468 | 0 | 0 | 0 | 0 | 0 |
Estimated dispersion | 0.365242693950514 | 0 | 0 | 0 | 0 | 0 |
The QIF method can be performed using the function
Selected.QIF()
, and the selected type can be then updated
in the “typed” vector of the function Modified.QIF()
.
## Gets initial estimates for the QIF approach through independence structure
initial=glm(y ~ x1, data=simdata58, family=gaussian)
beta_initial=as.matrix(initial$coefficients)
## Intercept is not included in the "x" and "typed".
## Selected.QIF(id, y, x, lod, substitue, corstr, beta, maxiter)
Selected.QIF(id, y, x1, lod, "MIWithID", "AR-1", beta_initial, 1000)
#> Selected type of time-dependent covariate
#> 3
#> Results based on the empirical MSE minimization criterion (EMMC)
Type | EMMC |
---|---|
1 | 0.0027988 |
2 | 0.0024224 |
3 | 0.0010959 |
When original or transformed data do not follow a known distribution, modeling the conditional mean of the outcome variable may not be ideal, as the estimated mean and standard deviation can be sensitive to large values. In such cases, quantile regression serves as an alternative analytical method. It does not assume an underlying distribution, offers advantages for skewed data, and is robust to outliers. Using regression models, substitution or fill-in approaches such as single and multiple value imputation techniques can be employed to impute measurements below the LOD (Chen et al., 2021).
To conduct regression analysis using the 15th simulated dataset, where the LOD is set to 2, any measurements below 2 are imputed using substitution approaches such as “LOD2”. The median or 50th quantile (τ = 0.5) is used. In cluster studies, an “exchangeable” structure is applied to the working correlation, while an “AR-1” structure is used for longitudinal datasets where subjects are measured over time.
y=as.matrix(as.vector(t(simdata15$y)))
x1=as.matrix(as.vector(t(simdata15$x1)))
x2=as.matrix(as.vector(t(simdata15$x2)))
x=cbind(matrix(1,length(x1),1),x1,x2)
## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%).
lod=2
## Median or 50th quantile is given.
tau=0.5
## Intercept is included in the "x" but not in the "typed".
## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data)
Quantile.FWZ(y, x, lod, "LOD2", tau, "exchangeable", c(1,1), simdata15)
#> Quantile level and correlation structure are provided in the column of Intercept.
#> Intercept X2 X3
#> Quantile Level 0.5 0 0
#> Structure exchangeable 0 0
#> Estimate 0.7361 1.4998 0.7495
#> Standard Error 0.1007 0.353 0.7583
#> 95% CI Lower 0.5363 0.7993 -0.7553
#> 95% CI Upper 0.9359 2.2003 2.2543
#> P-value 0 8e-04 0.2564
To handle time-dependent covariates, we use the 58th simulated
dataset, which includes 100 subjects (sample size is 100), each with
three repeated measurements. The “AR-1” working correlation structure is
embedded in the function Quantile.select.FWZ()
because
time-dependent covariates are typically associated with longitudinal
studies. A detailed description of the selection process can be found in
Chen et al. (2024).
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
x=cbind(matrix(1,length(x1),1),x1)
## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05
## 95th quantile is given.
tau=0.95
## Intercept is included in the "x".
## Quantile.select.FWZ(y, x, lod, substitue, tau, data)
Quantile.select.FWZ(y, x, lod, "LOD2", tau, simdata58)
#> beta.0 beta.1
#> Quantile Level 0.0000 0.9500
#> Type of Time-Dependency 0.0000 2.0000
#> Estimate 1.5715 0.5074
#> Standard Error 0.0932 0.0365
#> 95% CI Lower 1.3866 0.4350
#> 95% CI Upper 1.7564 0.5798
#> P-value 0.0000 0.0000
The selection criterion identifies type 2 as the
appropriate covariate type. The quantile regression result with the
updated covariate type of time-dependency is displayed. Additionally,
this selected type of time-dependent covariate (in this case,
type 2) can also be updated in the “typed” parameter of
the function Quantile.FWZ()
, enabling analysts to conduct
multivariable analysis.
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
x=cbind(matrix(1,length(x1),1),x1)
## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05
## 95th quantile is given.
tau=0.95
## Intercept is included in the "x" but not in the "typed".
## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data)
Quantile.FWZ(y, x, lod, "LOD2", tau, "AR-1", c(2), simdata58)
#> Quantile level and correlation structure are provided in the column of Intercept.
#> Intercept X2
#> Quantile Level 0.95 0
#> Structure AR-1 0
#> Estimate 1.5685 0.507
#> Standard Error 0.0907 0.0363
#> 95% CI Lower 1.3885 0.435
#> 95% CI Upper 1.7485 0.579
#> P-value 0 0
Chen IC, Bertke SJ, Curwin BD. Quantile regression for exposure data with repeated measures in the presence of non-detects. J Expo Sci Environ Epidemiol. 2021; 31(6): 1057–1066.
Chen IC, Bertke SJ, Estill CF. Compare the marginal effects for environmental exposure and biomonitoring data with repeated measurements and values below the limit of detection. J Expo Sci Environ Epidemiol. 2024. https://doi.org/10.1038/s41370-024-00640-7
Chen IC, Bertke SJ, Dahm MM. Quantile regression for longitudinal data with values below the limit of detection and time-dependent covariates – application to modeling carbon nanotube and nanofiber exposures. Ann Work Expo Health. 2024. https://doi.org/10.1093/annweh/wxae068
Chen IC, Westgate PM. Improved methods for the marginal analysis of longitudinal data in the presence of time-dependent covariates. Statistics in Medicine. 2017; 36(16): 2533–2546.
Chen IC, Westgate PM. A novel approach to selecting classification types for time-dependent covariates in the marginal analysis of longitudinal data. Statistical Methods in Medical Research. 2019; 28(10-11): 3176–3186.
Ganser GH, Hewett P. An accurate substitution method for analyzing censored data. J Occup Environ Hyg. 2010; 7(4): 233–44.
Lai TL, Small DS. Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method-of-moments approach. Journal of the Royal Statistical Society: Series B. 2007; 69(1): 79–99.