Title: | Semiparametric Likelihood Estimation with Errors in Variables |
---|---|
Description: | Efficient regression analysis under general two-phase sampling, where Phase I includes error-prone data and Phase II contains validated data on a subset. |
Authors: | Sarah Lotspeich [aut], Ran Tao [aut, cre], Joey Sherrill [prg], Jiangmei Xiong [ctb] |
Maintainer: | Ran Tao <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.5 |
Built: | 2024-12-25 07:13:33 UTC |
Source: | CRAN |
linear2ph
function. This function can be used to select the B-spline basis that yields the largest average predicted log likelihood.Performs cross-validation to calculate the average predicted log likelihood for the linear2ph
function. This function can be used to select the B-spline basis that yields the largest average predicted log likelihood.
cv_linear2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, nfolds = 5, MAX_ITER = 2000, TOL = 1e-04, verbose = FALSE )
cv_linear2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, nfolds = 5, MAX_ITER = 2000, TOL = 1e-04, verbose = FALSE )
Y_unval |
Specifies the column of the error-prone outcome that is continuous. Subjects with missing values of |
Y |
Specifies the column that stores the validated value of |
X_unval |
Specifies the columns of the error-prone covariates. Subjects with missing values of |
X |
Specifies the columns that store the validated values of |
Z |
Specifies the columns of the accurately measured covariates. Subjects with missing values of |
Bspline |
Specifies the columns of the B-spline basis. Subjects with missing values of |
data |
Specifies the name of the dataset. This argument is required. |
nfolds |
Specifies the number of cross-validation folds. The default value is |
MAX_ITER |
Specifies the maximum number of iterations in the EM algorithm. The default number is |
TOL |
Specifies the convergence criterion in the EM algorithm. The default value is |
verbose |
If |
avg_pred_loglike |
Stores the average predicted log likelihood. |
pred_loglike |
Stores the predicted log likelihood in each fold. |
converge |
Stores the convergence status of the EM algorithm in each run. |
rho = 0.3 p = 0.3 n = 100 n2 = 40 alpha = 0.3 beta = 0.4 ### generate data simX = rnorm(n) epsilon = rnorm(n) simY = alpha+beta*simX+epsilon error = MASS::mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, rho, rho, 1), nrow=2)) simS = rbinom(n, 1, p) simU = simS*error[,2] simW = simS*error[,1] simY_tilde = simY+simW simX_tilde = simX+simU id_phase2 = sample(n, n2) simY[-id_phase2] = NA simX[-id_phase2] = NA # cubic basis nsieves = c(5, 10) pred_loglike = rep(NA, length(nsieves)) for (i in 1:length(nsieves)) { nsieve = nsieves[i] Bspline = splines::bs(simX_tilde, df=nsieve, degree=3, Boundary.knots=range(simX_tilde), intercept=TRUE) colnames(Bspline) = paste("bs", 1:nsieve, sep="") # cubic basis data = data.frame(Y_tilde=simY_tilde, X_tilde=simX_tilde, Y=simY, X=simX, Bspline) ### generate data res = cv_linear2ph(Y="Y", X="X", Y_unval="Y_tilde", X_unval="X_tilde", Bspline=colnames(Bspline), data=data, nfolds = 5) pred_loglike[i] = res$avg_pred_loglik } data.frame(nsieves, pred_loglike)
rho = 0.3 p = 0.3 n = 100 n2 = 40 alpha = 0.3 beta = 0.4 ### generate data simX = rnorm(n) epsilon = rnorm(n) simY = alpha+beta*simX+epsilon error = MASS::mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, rho, rho, 1), nrow=2)) simS = rbinom(n, 1, p) simU = simS*error[,2] simW = simS*error[,1] simY_tilde = simY+simW simX_tilde = simX+simU id_phase2 = sample(n, n2) simY[-id_phase2] = NA simX[-id_phase2] = NA # cubic basis nsieves = c(5, 10) pred_loglike = rep(NA, length(nsieves)) for (i in 1:length(nsieves)) { nsieve = nsieves[i] Bspline = splines::bs(simX_tilde, df=nsieve, degree=3, Boundary.knots=range(simX_tilde), intercept=TRUE) colnames(Bspline) = paste("bs", 1:nsieve, sep="") # cubic basis data = data.frame(Y_tilde=simY_tilde, X_tilde=simX_tilde, Y=simY, X=simX, Bspline) ### generate data res = cv_linear2ph(Y="Y", X="X", Y_unval="Y_tilde", X_unval="X_tilde", Bspline=colnames(Bspline), data=data, nfolds = 5) pred_loglike[i] = res$avg_pred_loglik } data.frame(nsieves, pred_loglike)
Performs efficient semiparametric estimation for general two-phase measurement error models when there are errors in both the outcome and covariates.
linear2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, hn_scale = 1, noSE = FALSE, TOL = 1e-04, MAX_ITER = 1000, verbose = FALSE )
linear2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, hn_scale = 1, noSE = FALSE, TOL = 1e-04, MAX_ITER = 1000, verbose = FALSE )
Y_unval |
Column name of the error-prone or unvalidated continuous outcome. Subjects with missing values of |
Y |
Column name that stores the validated value of |
X_unval |
Specifies the columns of the error-prone covariates. Subjects with missing values of |
X |
Specifies the columns that store the validated values of |
Z |
Specifies the columns of the accurately measured covariates. Subjects with missing values of |
Bspline |
Specifies the columns of the B-spline basis. Subjects with missing values of |
data |
Specifies the name of the dataset. This argument is required. |
hn_scale |
Specifies the scale of the perturbation constant in the variance estimation. For example, if |
noSE |
If |
TOL |
Specifies the convergence criterion in the EM algorithm. The default value is |
MAX_ITER |
Maximum number of iterations in the EM algorithm. The default number is |
verbose |
If |
coefficients |
Stores the analysis results. |
sigma |
Stores the residual standard error. |
covariance |
Stores the covariance matrix of the regression coefficient estimates. |
converge |
In parameter estimation, if the EM algorithm converges, then |
converge_cov |
In variance estimation, if the EM algorithm converges, then |
Tao, R., Mercaldo, N. D., Haneuse, S., Maronge, J. M., Rathouz, P. J., Heagerty, P. J., & Schildcrout, J. S. (2021). Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Statistics in Medicine, 40(8), 1863–1876. https://doi.org/10.1002/sim.8876
cv_linear2ph()
to calculate the average predicted log likelihood of this function.
rho = -.3 p = 0.3 hn_scale = 1 nsieve = 20 n = 100 n2 = 40 alpha = 0.3 beta = 0.4 set.seed(12345) ### generate data simX = rnorm(n) epsilon = rnorm(n) simY = alpha+beta*simX+epsilon error = MASS::mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, rho, rho, 1), nrow=2)) simS = rbinom(n, 1, p) simU = simS*error[,2] simW = simS*error[,1] simY_tilde = simY+simW simX_tilde = simX+simU id_phase2 = sample(n, n2) simY[-id_phase2] = NA simX[-id_phase2] = NA # # histogram basis # Bspline = matrix(NA, nrow=n, ncol=nsieve) # cut_x_tilde = cut(simX_tilde, breaks=quantile(simX_tilde, probs=seq(0, 1, 1/nsieve)), # include.lowest = TRUE) # for (i in 1:nsieve) { # Bspline[,i] = as.numeric(cut_x_tilde == names(table(cut_x_tilde))[i]) # } # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # histogram basis # # linear basis # Bspline = splines::bs(simX_tilde, df=nsieve, degree=1, # Boundary.knots=range(simX_tilde), intercept=TRUE) # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # linear basis # # quadratic basis # Bspline = splines::bs(simX_tilde, df=nsieve, degree=2, # Boundary.knots=range(simX_tilde), intercept=TRUE) # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # quadratic basis # cubic basis Bspline = splines::bs(simX_tilde, df=nsieve, degree=3, Boundary.knots=range(simX_tilde), intercept=TRUE) colnames(Bspline) = paste("bs", 1:nsieve, sep="") # cubic basis data = data.frame(Y_tilde=simY_tilde, X_tilde=simX_tilde, Y=simY, X=simX, Bspline) res = linear2ph(Y="Y", X="X", Y_unval="Y_tilde", X_unval="X_tilde", Bspline=colnames(Bspline), data=data, hn_scale=0.1)
rho = -.3 p = 0.3 hn_scale = 1 nsieve = 20 n = 100 n2 = 40 alpha = 0.3 beta = 0.4 set.seed(12345) ### generate data simX = rnorm(n) epsilon = rnorm(n) simY = alpha+beta*simX+epsilon error = MASS::mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, rho, rho, 1), nrow=2)) simS = rbinom(n, 1, p) simU = simS*error[,2] simW = simS*error[,1] simY_tilde = simY+simW simX_tilde = simX+simU id_phase2 = sample(n, n2) simY[-id_phase2] = NA simX[-id_phase2] = NA # # histogram basis # Bspline = matrix(NA, nrow=n, ncol=nsieve) # cut_x_tilde = cut(simX_tilde, breaks=quantile(simX_tilde, probs=seq(0, 1, 1/nsieve)), # include.lowest = TRUE) # for (i in 1:nsieve) { # Bspline[,i] = as.numeric(cut_x_tilde == names(table(cut_x_tilde))[i]) # } # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # histogram basis # # linear basis # Bspline = splines::bs(simX_tilde, df=nsieve, degree=1, # Boundary.knots=range(simX_tilde), intercept=TRUE) # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # linear basis # # quadratic basis # Bspline = splines::bs(simX_tilde, df=nsieve, degree=2, # Boundary.knots=range(simX_tilde), intercept=TRUE) # colnames(Bspline) = paste("bs", 1:nsieve, sep="") # # quadratic basis # cubic basis Bspline = splines::bs(simX_tilde, df=nsieve, degree=3, Boundary.knots=range(simX_tilde), intercept=TRUE) colnames(Bspline) = paste("bs", 1:nsieve, sep="") # cubic basis data = data.frame(Y_tilde=simY_tilde, X_tilde=simX_tilde, Y=simY, X=simX, Bspline) res = linear2ph(Y="Y", X="X", Y_unval="Y_tilde", X_unval="X_tilde", Bspline=colnames(Bspline), data=data, hn_scale=0.1)
This function returns the sieve maximum likelihood estimators (SMLE) for the logistic regression model from Lotspeich et al. (2021).
logistic2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, hn_scale = 1, noSE = FALSE, TOL = 1e-04, MAX_ITER = 1000, verbose = FALSE )
logistic2ph( Y_unval = NULL, Y = NULL, X_unval = NULL, X = NULL, Z = NULL, Bspline = NULL, data = NULL, hn_scale = 1, noSE = FALSE, TOL = 1e-04, MAX_ITER = 1000, verbose = FALSE )
Y_unval |
Column name of the error-prone or unvalidated binary outcome. This argument is required. |
Y |
Column name that stores the validated value of |
X_unval |
Specifies the columns of the error-prone covariates. This argument is required. |
X |
Specifies the columns that store the validated values of |
Z |
Specifies the columns of the accurately measured covariates. This argument is optional. |
Bspline |
Specifies the columns of the B-spline basis. This argument is required. |
data |
Specifies the name of the dataset. This argument is required. |
hn_scale |
Specifies the scale of the perturbation constant in the variance estimation. For example, if |
noSE |
If |
TOL |
Specifies the convergence criterion in the EM algorithm. The default value is |
MAX_ITER |
Maximum number of iterations in the EM algorithm. The default number is |
verbose |
If |
coefficients |
Stores the analysis results. |
outcome_err_coefficients |
Stores the outcome error model results. |
Bspline_coefficients |
Stores the final B-spline coefficient estimates. |
covariance |
Stores the covariance matrix of the regression coefficient estimates. |
converge |
In parameter estimation, if the EM algorithm converges, then |
converge_cov |
In variance estimation, if the EM algorithm converges, then |
converge_msg |
In parameter estimation, if the EM algorithm does not converge, then |
Lotspeich, S. C., Shepherd, B. E., Amorim, G. G. C., Shaw, P. A., & Tao, R. (2021). Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort. Biometrics, biom.13512. https://doi.org/10.1111/biom.13512
A simulated dataset constructed to imitate the Vanderbilt Comprehensive Care Clinic (VCCC) patient records, which have been fully validated and therefore contain validated and unvalidated versions of all variables. The VCCC cohort is a good candidate for the purpose of illustration. The data presented in this section are a mocked-up version of the actual data due to confidentiality, but the data structure and features, such as mean and variability, closely resemble the real dataset.
mock.vccc
mock.vccc
A data frame with 2087 rows and 8 variables:
patient ID
viral load at antiretroviral therapy (ART) initiation, error-prone outcome, continuous
viral load at antiretroviral therapy (ART) initiation, validated outcome, continuous
having an AIDS-defining event (ADE) within one year of ART initiation, error-prone outcome, binary
having an AIDS-defining event (ADE) within one year of ART initiation, validated outcome, binary
CD4 count at ART initiation, error-prone covariate, continuous
CD4 count at ART initiation, validated covariate, continuous
whether patient is ART naive at enrollment, error-free covariate, binary
sex of patient, 1 indicates male and 0 indicates female & error-free covariate, binary
age of patient, error-free covariate, continuous
https://www.vanderbilthealth.com/clinic/comprehensive-care-clinic