--- title: "User Guide for the R Package: _marlod_" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{User Guide for the R Package: _marlod_} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## 1. Introduction This vignette provides instructions on how to perform marginal modeling for exposure data with repeated measures and non-detects. Environmental exposure and biomonitoring data from environmental and occupational studies often exhibit right-skewed distributions and left-censored. Left censoring occurs when laboratory instruments have a limit of detection (LOD) below which measurements are not provided. To conduct regression models for repeated measures data with non-detects, it is necessary to install the R package: ```{r setup} library(marlod) ``` After building and installing the package, you can view the vignette by using the command: ```r browseVignettes(package = 'marlod') ``` ## 2. Substitution Methods To impute values that are below the LOD, the function `Fillin()` can be used. This section demonstrates the function using an example from Ganser and Hewett (2010). The available substitution methods for this function include: "None", "LOD", "LOD2", "LODS2", and "QQplot". A detailed description of each substitution method can be found in the Reference Manual of the Documentation. ```{r} y <- c(0,0,0,3.06,4.41,7.23,8.29,9.52,19.94,20.25) ## Limit of detection (LOD) = 3 lod <- 3 Fillin(y, lod, "BetaMean") Fillin(y, lod, "BetaGM") ``` ## 3. Outcome Data: Normal or Log-Normal Distribution Conventional statistical analyses often assume that responses or outcome data follow a normal distribution. If the underlying distribution is log-normal, data transformation using the natural logarithm required. To analyze such data, marginal models utilizing estimation methods such as generalized estimating equations (GEE), quadratic inference functions (QIF), and generalized method of moments (GMM) are employed. These methods incorporate the imputation of measurements below the LOD using single and multiple value imputation techniques (Chen _et al_., 2024). ### 3-1. Simulated Dataset 15 The 15th dataset from a simulation study includes 100 subjects (sample size is 100), each with three repeated measurements. The independent variables or covariates (x1 and x2) are simulated from a Bernoulli distribution with a parameter value of $p = 0.5$ and a uniform distribution $U(0, 1)$, respectively. Correlated errors for repeated measures models are accounted for and assumed to follow a multivariate normal distribution, $MVN(0, R(\alpha))$. A first-order autoregressive (AR-1) correlation structure with a correlation parameter of $\alpha = 0.7$ is incorporated. The true values of 1, 1, and 1 correspond to the marginal intercept and two slopes. ```{r} data(simdata15) head(simdata15) ``` ### 3-2. Marginal Modeling (Mean Regression Model) To conduct regression analysis using the 15th simulated dataset, assume the LOD is 2; thus, any measurements below 2 would be imputed using substitution approaches, such as "QQplot". For repeated measures in a cluster study, an "exchangeable" correlation structure is used, while "AR-1" is appropriate for longitudinal datasets where subjects are measured over time. The example below employs the function `Modified.GEE()` for the GEE method. The atomic vector "typed" specifies the types of time-dependent covaraites, with the length of the vector equal to the number of regression parameters, excluding the intercept. For time-independent covariates or those in a cluster study, "1" is assigned, thus "c(1,1)" is used for "typed". The maximum number of iterations is set to 1,000. ```{r} id=as.matrix(as.vector(t(simdata15$id))) y=as.matrix(as.vector(t(simdata15$y))) x1=as.matrix(as.vector(t(simdata15$x1))) x2=as.matrix(as.vector(t(simdata15$x2))) x=cbind(x1,x2) ## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%). lod=2 ## Intercept is not included in the "x" and "typed". ## Modified.GEE(id, y, x, lod, substitue, corstr, typetd, maxiter) Modified.GEE(id, y, x, lod, "QQplot", "AR-1", c(1,1), 1000) ``` The QIF method is performed using the function `Modified.GEE()`, which typically demonstrates efficiency advantages over GEE in large sample sizes. ```{r} ## Gets initial estimates for the QIF approach through independence structure initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian) beta_initial=as.matrix(initial$coefficients) ## Intercept is not included in the "x" and "typed". ## Modified.QIF(id, y, x, lod, substitue, corstr, beta, typetd, maxiter) Modified.QIF(id, y, x, lod, "QQplot", "exchangeable", beta_initial, c(1,1), 1000) ``` Another estimation method, GMM, is employed using the function `Modified.GMM()`. ```{r} ## Gets initial estimates for the GMM approach through independence structure initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian) beta_initial=as.matrix(initial$coefficients) ## Intercept is not included in the "x" and "typed". ## Modified.GMM(id, y, x, lod, substitue, beta, maxiter) Modified.GMM(id, y, x, lod, "QQplot", beta_initial, 1000) ``` ### 3-3. Time-Dependent Covariates To handle time-dependent covariates, we use the 58th dataset from a simulation study, which includes 100 subjects (sample size is 100), each with three measurements collected over time. Detailed model mechanism are described in the setting II for type III time-dependent covariate on page 90 of Lai and Small (2007). ```{r} data(simdata58) head(simdata58) ``` Failure to account for time-dependency in covariates can lead to inefficient regression parameter estimation. The function `Selected.GEE()` allows the selection of a type of time-dependent covariate through a marginal mean regression model using the GEE estimation method for longitudinal data with values below the LOD. The selection approach applied to choose a type of time-dependency is the empirical mean squared error (MSE) minimization criterion (Chen and Westgate, 2017, 2019). Given the longitudinal nature of the dataset, "AR-1" is used as the working correlation structure. ```{r} id=as.matrix(as.vector(t(simdata58$id))) y=as.matrix(as.vector(t(simdata58$y))) x1=as.matrix(as.vector(t(simdata58$x1))) ## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%). lod=0.05 ## Intercept is not included in the "x". ## Selected.GEE(id, y, x, lod, substitue, corstr, maxiter) Selected.GEE(id, y, x1, lod, "MIWithID", "AR-1", 1000) ``` Once the type of time-dependent covaraite is selected, i.e., in this case, **type 3**, it can be updated in the "typed" vector of the function `Modified.GEE()`. ```{r} id=as.matrix(as.vector(t(simdata58$id))) y=as.matrix(as.vector(t(simdata58$y))) x1=as.matrix(as.vector(t(simdata58$x1))) Modified.GEE(id, y, x1, lod, "MIWithID", "AR-1", c(3), 1000) ``` The QIF method can be performed using the function `Selected.QIF()`, and the selected type can be then updated in the "typed" vector of the function `Modified.QIF()`. ```{r} ## Gets initial estimates for the QIF approach through independence structure initial=glm(y ~ x1, data=simdata58, family=gaussian) beta_initial=as.matrix(initial$coefficients) ## Intercept is not included in the "x" and "typed". ## Selected.QIF(id, y, x, lod, substitue, corstr, beta, maxiter) Selected.QIF(id, y, x1, lod, "MIWithID", "AR-1", beta_initial, 1000) ``` ## 4. Outcome Data: Unknown Distribution When original or transformed data do not follow a known distribution, modeling the conditional mean of the outcome variable may not be ideal, as the estimated mean and standard deviation can be sensitive to large values. In such cases, quantile regression serves as an alternative analytical method. It does not assume an underlying distribution, offers advantages for skewed data, and is robust to outliers. Using regression models, substitution or fill-in approaches such as single and multiple value imputation techniques can be employed to impute measurements below the LOD (Chen _et al_., 2021). ### 4-1. Marginal Modeling (Quantile Regression Model) To conduct regression analysis using the 15th simulated dataset, where the LOD is set to 2, any measurements below 2 are imputed using substitution approaches such as "LOD2". The median or 50th quantile ($\tau = 0.5$) is used. In cluster studies, an "exchangeable" structure is applied to the working correlation, while an "AR-1" structure is used for longitudinal datasets where subjects are measured over time. ```{r} y=as.matrix(as.vector(t(simdata15$y))) x1=as.matrix(as.vector(t(simdata15$x1))) x2=as.matrix(as.vector(t(simdata15$x2))) x=cbind(matrix(1,length(x1),1),x1,x2) ## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%). lod=2 ## Median or 50th quantile is given. tau=0.5 ## Intercept is included in the "x" but not in the "typed". ## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data) Quantile.FWZ(y, x, lod, "LOD2", tau, "exchangeable", c(1,1), simdata15) ``` ### 4-2. Time-Dependent Covariates To handle time-dependent covariates, we use the 58th simulated dataset, which includes 100 subjects (sample size is 100), each with three repeated measurements. The "AR-1" working correlation structure is embedded in the function `Quantile.select.FWZ()` because time-dependent covariates are typically associated with longitudinal studies. A detailed description of the selection process can be found in Chen _et al_. (2024). ```{r} y=as.matrix(as.vector(t(simdata58$y))) x1=as.matrix(as.vector(t(simdata58$x1))) x=cbind(matrix(1,length(x1),1),x1) ## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%). lod=0.05 ## 95th quantile is given. tau=0.95 ## Intercept is included in the "x". ## Quantile.select.FWZ(y, x, lod, substitue, tau, data) Quantile.select.FWZ(y, x, lod, "LOD2", tau, simdata58) ``` The selection criterion identifies **type 2** as the appropriate covariate type. The quantile regression result with the updated covariate type of time-dependency is displayed. Additionally, this selected type of time-dependent covariate (in this case, **type 2**) can also be updated in the "typed" parameter of the function `Quantile.FWZ()`, enabling analysts to conduct multivariable analysis. ```{r} y=as.matrix(as.vector(t(simdata58$y))) x1=as.matrix(as.vector(t(simdata58$x1))) x=cbind(matrix(1,length(x1),1),x1) ## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%). lod=0.05 ## 95th quantile is given. tau=0.95 ## Intercept is included in the "x" but not in the "typed". ## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data) Quantile.FWZ(y, x, lod, "LOD2", tau, "AR-1", c(2), simdata58) ``` ## 5. References - Chen IC, Bertke SJ, Curwin BD. Quantile regression for exposure data with repeated measures in the presence of non-detects. _J Expo Sci Environ Epidemiol._ 2021; 31(6): 1057–1066. - Chen IC, Bertke SJ, Estill CF. Compare the marginal effects for environmental exposure and biomonitoring data with repeated measurements and values below the limit of detection. _J Expo Sci Environ Epidemiol._ 2024. https://doi.org/10.1038/s41370-024-00640-7 - Chen IC, Bertke SJ, Dahm MM. Quantile regression for longitudinal data with values below the limit of detection and time-dependent covariates – application to modeling carbon nanotube and nanofiber exposures. _Ann Work Expo Health._ 2024. https://doi.org/10.1093/annweh/wxae068 - Chen IC, Westgate PM. Improved methods for the marginal analysis of longitudinal data in the presence of time-dependent covariates. _Statistics in Medicine._ 2017; 36(16): 2533–2546. - Chen IC, Westgate PM. A novel approach to selecting classification types for time-dependent covariates in the marginal analysis of longitudinal data. _Statistical Methods in Medical Research._ 2019; 28(10-11): 3176–3186. - Ganser GH, Hewett P. An accurate substitution method for analyzing censored data. _J Occup Environ Hyg_. 2010; 7(4): 233–44. - Lai TL, Small DS. Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method-of-moments approach. _Journal of the Royal Statistical Society: Series B._ 2007; 69(1): 79–99.