In this vignette, we provide a demonstration of the R Package COMMA (correcting misclassified mediation analysis). This package provides methods for estimating a mediation analysis when the binary mediator is potentially misclassified. Technical details about estimation are not included in this demonstration. For additional information on the methods used in this R Package, please consult ``Effect estimation in the presence of a misclassified binary mediator’’ by Kimberly A. Hochstedler Webb and Martin T. Wells.
Let X denote a predictor of interest. C denotes a matrix of covariates. Y is the outcome variable of interest. The relationship between X and Y may be mediated by M. M = m denotes an observation’s true mediator value, with m ∈ {1, 2}. We do not observe M directly. Instead, we observe M*, a potentially misclassified (noisy) version of M. Given M, a subject’s observed mediator value M* depends on a set of predictors Z. The true mediator, observed mediator, and outcome mechanisms are provided below.
True mediator mechanism: logit{P(M = 1|X, C; β)} = β0 + βXX + βCC $$\text{Observed mediator mechanisms: } \text{logit}\{ P(M^* = 1 | M = 1, \boldsymbol{Z} ; \boldsymbol{\gamma}) \} = \gamma_{110} + \boldsymbol{\gamma_{11Z} Z}, \\ \text{logit}\{ P(M^* = 1 | M = 2, \boldsymbol{Z} ; \boldsymbol{\gamma}) \} = \gamma_{120} + \boldsymbol{\gamma_{12Z} Z}$$ Outcome mechanism: E(Y|X, C, M) = θ0 + θXX + θCC + θMM + θXMXM If we have a Bernoulli outcome that we model with a logit link, the outcome mechanism can be logit{P(Y = 1|X, C, M)} = θ0 + θXX + θCC + θmM + θXMXM.
We begin this demonstration by generating data using the
COMMA_data()
function. The binary mediator simulated by
this scheme is subject to misclassification. The predictor related to
the true outcome mechanism is “x” and the predictor related to the
observation mechanism is “z”.
library(COMMA)
library(dplyr)
set.seed(20240422)
sample_size <- 10000
n_cat <- 2 # Number of categories in the binary mediator
# Data generation settings
x_mu <- 0
x_sigma <- 1
z_shape <- 1
z_scale <- 1
c_shape <- 1
# True parameter values (gamma terms set the misclassification rate)
true_beta <- matrix(c(1, -2, .5), ncol = 1)
true_gamma <- matrix(c(1.8, 1, -1.5, -1), nrow = 2, byrow = FALSE)
true_theta <- matrix(c(1, 1.5, -2.5, -.2), ncol = 1)
# Generate data.
my_data <- COMMA_data(sample_size, x_mu, x_sigma, z_shape, c_shape,
interaction_indicator = FALSE,
outcome_distribution = "Normal",
true_beta, true_gamma, true_theta)
# Save list elements as vectors.
Mstar = my_data[["obs_mediator"]]
Mstar_01 <- ifelse(Mstar == 1, 1, 0)
outcome = my_data[["outcome"]]
x_matrix = my_data[["x"]]
z_matrix = my_data[["z"]]
c_matrix = my_data[["c"]]
We propose estimation methods using the Expectation-Maximization algorithm (EM) and an ordinary least squares (OLS) correction procedure. The proposed predictive value weighting (PVW) approach detailed in Webb and Wells (2024) is currently only available for Bernoulli outcomes in COMMA. Each method checks and corrects instances of label switching, as described in Webb and Wells (2024). In the code below, we provide functions for implementing these methods.
# Supply starting values for all parameters.
beta_start <- coef(glm(Mstar_01 ~ x_matrix + c_matrix,
family = "binomial"(link = "logit")))
gamma_start <- matrix(rep(1,4), ncol = 2, nrow = 2, byrow = FALSE)
theta_start <- coef(lm(outcome ~ x_matrix + Mstar_01 + c_matrix))
# Estimate parameters using the EM-Algorithm.
EM_results <- COMMA_EM(Mstar, outcome, outcome_distribution = "Normal",
interaction_indicator = FALSE,
x_matrix, z_matrix, c_matrix,
beta_start, gamma_start, theta_start, sigma_start = 1)
EM_results$True_Value <- c(true_beta, c(true_gamma), true_theta, 1)
EM_results$Estimates <- round(EM_results$Estimates, 3)
EM_results
## Parameter Estimates Convergence True_Value
## 1 beta_0 0.928 TRUE 1.0
## 2 beta_1 -1.942 TRUE -2.0
## 3 beta_2 0.491 TRUE 0.5
## 4 gamma11 1.749 TRUE 1.8
## 5 gamma21 1.197 TRUE 1.0
## 6 gamma12 -1.541 TRUE -1.5
## 7 gamma22 -0.830 TRUE -1.0
## 8 theta_0 1.017 TRUE 1.0
## 9 theta_x1 1.502 TRUE 1.5
## 10 theta_m -2.525 TRUE -2.5
## 11 theta_c1 -0.199 TRUE -0.2
## 12 sigma 0.993 TRUE 1.0
# Estimate parameters using the OLS correction.
OLS_results <- COMMA_OLS(Mstar, outcome,
x_matrix, z_matrix, c_matrix,
beta_start, gamma_start, theta_start)
OLS_results$True_Value <- c(true_beta, c(true_gamma), true_theta[c(1,3,2,4)])
OLS_results$Estimates <- round(OLS_results$Estimates, 3)
OLS_results
## Parameter Estimates Convergence Method True_Value
## 1 beta1 0.877 TRUE OLS 1.0
## 2 beta2 -1.794 TRUE OLS -2.0
## 3 beta3 0.493 TRUE OLS 0.5
## 4 gamma11 1.788 TRUE OLS 1.8
## 5 gamma21 1.361 TRUE OLS 1.0
## 6 gamma12 -1.484 TRUE OLS -1.5
## 7 gamma22 -1.487 TRUE OLS -1.0
## 8 theta0 0.882 TRUE OLS 1.0
## 9 theta_m -2.326 TRUE OLS -2.5
## 10 theta_x 1.581 TRUE OLS 1.5
## 11 theta_c1 -0.203 TRUE OLS -0.2
NormalY_results <- data.frame(Parameter = rep(c("beta_x", "theta_x", "theta_m"),
3),
Method = c(rep("EM", 3), rep("OLS", 3),
rep("Naive", 3)),
Estimate = c(EM_results$Estimates[c(2, 9, 10)],
OLS_results$Estimates[c(2, 9, 10)],
beta_start[2], theta_start[c(2,3)]),
True_Value = rep(c(true_beta[2],
true_theta[2], true_theta[3]),
3))
ggplot(data = NormalY_results) +
geom_point(aes(x = Method, y = Estimate, color = Method),
size = 3) +
geom_hline(aes(yintercept = True_Value),
linetype = "dashed") +
facet_wrap(~Parameter) +
theme_minimal() +
scale_color_manual(values = c("#DA86A5", "#785E95", "#409DBE")) +
ggtitle("Parameter estimates from EM, OLS, and Naive approaches",
subtitle = "Normal outcome model with a misclassified mediator")
Next, we generate data with a Bernoulli outcome using the
COMMA_data()
function. Once again, the binary mediator
simulated by this scheme is subject to misclassification. The predictor
related to the true outcome mechanism is “x” and the predictor related
to the observation mechanism is “z”.
library(COMMA)
library(dplyr)
set.seed(20240423)
sample_size <- 10000
n_cat <- 2 # Number of categories in the binary mediator
# Data generation settings
x_mu <- 0
x_sigma <- 1
z_shape <- 1
z_scale <- 1
c_shape <- 1
# True parameter values (gamma terms set the misclassification rate)
true_beta <- matrix(c(1, -2, .5), ncol = 1)
true_gamma <- matrix(c(1.8, 1, -1.5, -1), nrow = 2, byrow = FALSE)
true_theta <- matrix(c(1, 1.5, -2.5, -.2, .5), ncol = 1)
# Generate data.
my_data <- COMMA_data(sample_size, x_mu, x_sigma, z_shape, c_shape,
interaction_indicator = TRUE,
outcome_distribution = "Bernoulli",
true_beta, true_gamma, true_theta)
# Save list elements as vectors.
Mstar = my_data[["obs_mediator"]]
Mstar_01 <- ifelse(Mstar == 1, 1, 0)
outcome = my_data[["outcome"]]
x_matrix = my_data[["x"]]
z_matrix = my_data[["z"]]
c_matrix = my_data[["c"]]
We propose estimation methods using the Expectation-Maximization algorithm (EM) and a predictive value weighting (PVW) approach. The ordinary least squares (OLS) correction procedure is only appropriate for Normal outcome models. Each method checks and corrects instances of label switching, as described in Webb and Wells (2024). In the code below, we provide functions for implementing these methods.
# Supply starting values for all parameters.
beta_start <- coef(glm(Mstar_01 ~ x_matrix + c_matrix,
family = "binomial"(link = "logit")))
gamma_start <- matrix(rep(1,4), ncol = 2, nrow = 2, byrow = FALSE)
xm_interaction <- x_matrix * c_matrix
theta_start <- coef(glm(outcome ~ x_matrix + Mstar_01 + c_matrix +
xm_interaction,
family = "binomial"(link = "logit")))
# Estimate parameters using the EM-Algorithm.
EM_results <- COMMA_EM(Mstar, outcome, outcome_distribution = "Bernoulli",
interaction_indicator = TRUE,
x_matrix, z_matrix, c_matrix,
beta_start, gamma_start, theta_start)
EM_results$True_Value <- c(true_beta, c(true_gamma), true_theta)
EM_results$Estimates <- round(EM_results$Estimates, 3)
EM_results
## Parameter Estimates Convergence True_Value
## 1 beta_0 1.113 TRUE 1.0
## 2 beta_1 -2.012 TRUE -2.0
## 3 beta_2 0.493 TRUE 0.5
## 4 gamma11 1.698 TRUE 1.8
## 5 gamma21 1.036 TRUE 1.0
## 6 gamma12 -1.664 TRUE -1.5
## 7 gamma22 -1.550 TRUE -1.0
## 8 theta_0 1.067 TRUE 1.0
## 9 theta_x 1.565 TRUE 1.5
## 10 theta_m -2.538 TRUE -2.5
## 11 theta_c1 -0.205 TRUE -0.2
## 12 theta_xm 0.427 TRUE 0.5
PVW_results <- COMMA_PVW(Mstar, outcome, outcome_distribution = "Bernoulli",
interaction_indicator = TRUE,
x_matrix, z_matrix, c_matrix,
beta_start, gamma_start, theta_start)
PVW_results$True_Value <- c(true_beta, c(true_gamma), true_theta)
PVW_results$Estimates <- round(PVW_results$Estimates, 3)
PVW_results
## Parameter Estimates Convergence Method True_Value
## 1 beta_1 1.127 TRUE PVW 1.0
## 2 beta_2 -2.008 TRUE PVW -2.0
## 3 beta_3 0.489 TRUE PVW 0.5
## 4 gamma11 1.704 TRUE PVW 1.8
## 5 gamma21 1.016 TRUE PVW 1.0
## 6 gamma12 -1.740 TRUE PVW -1.5
## 7 gamma22 -1.517 TRUE PVW -1.0
## 8 theta_0 0.998 TRUE PVW 1.0
## 9 theta_x1 1.678 TRUE PVW 1.5
## 10 theta_m -2.463 TRUE PVW -2.5
## 11 theta_c1 -0.204 TRUE PVW -0.2
## 12 theta_xm 0.288 TRUE PVW 0.5
BernoulliY_results <- data.frame(Parameter = rep(c("beta_x", "theta_x",
"theta_m", "theta_xm"),
3),
Method = c(rep("EM", 4), rep("PVW", 4),
rep("Naive", 4)),
Estimate = c(EM_results$Estimates[c(2, 9, 10, 12)],
PVW_results$Estimates[c(2, 9, 10, 12)],
beta_start[2],
theta_start[c(2,3,5)]),
True_Value = rep(c(true_beta[2],
true_theta[2],
true_theta[3],
true_theta[5]),
3))
ggplot(data = BernoulliY_results) +
geom_point(aes(x = Method, y = Estimate, color = Method),
size = 3) +
geom_hline(aes(yintercept = True_Value),
linetype = "dashed") +
facet_wrap(~Parameter) +
theme_minimal() +
scale_color_manual(values = c("#DA86A5", "#785E95", "#ECA698")) +
ggtitle("Parameter estimates from EM, PVW, and Naive approaches",
subtitle = "Bernoulli outcome model with a misclassified mediator")