unmconf
is a package that employs a fully Bayesian
hierarchical framework for modeling under the presence of unmeasured
confounders using JAGS (Just Another Gibbs Sampler). The package handles
both internal and external validation scenarios for up to two unmeasured
confounders. Bayesian data analysis can be summarized in the following
four steps: specifying the data model and prior, estimating model
parameters, evaluating sampling quality and model fit, and summarizing
and interpreting results. For users new to Markov Chain Monte Carlo
(MCMC) software who wish to implement models involving unmeasured
confounding, challenges arise in regard to understanding the syntax and
how these programs handle missing data. The primary objective of
unmconf
is to address these challenges by creating a
function that resembles glm()
on the front end, while
seamlessly implementing the necessary JAGS code on the back end.
Functions are implemented to simplify the workflow using this model by
acquiring data, modeling data, conducting diagnostics testing on the
model, and analyzing results. With this package, users can perform
robust fully Bayesian analyses, even without previous familiarity with
JAGS syntax or data processing intricacies.
For the statistical model, we denote the continuous or discrete response Y, the binary main exposure variable X, the vector of p other perfectly observed covariates C, and the unmeasured confounder(s) relating to both Y and X U. In the event of more unmeasured covariates, we denote them U1, U2, and so forth; these unmeasured confounders can be either binary or normally distributed.
In the scenario where there is a single unmeasured confounder, the joint distribution can be factorized as f(y, u|x, c) = f(y|x, u, c)f(u|x, c). For the second unmeasured confounder, the joint distribution can be factorized as f(y, u1, u2|x, c) = f(y|x, c, u1, u2)f(u1|x, c, u2)f(u2|x, c), giving the Bayesian unmeasured confounding model:
where v′ = [x c′]′
denotes the vector of the main exposure variable and all of the
perfectly observed covariates and u′ = [u1, u2]′
denotes the vector of two unmeasured confounders in the response model.
This model is completed by the specification of a link function g* and some family of
distributions D*.
Additional parameters for certain distributions–if any–are denoted ξy, ξu1, ξu2.
Examples of these would be σ2 for the variance of a
normal distribution or α for
the shape parameter in the gamma distribution. unmconf
allows for the user to work with a response from the normal, Poisson,
gamma, or binomial distribution and unmeasured confounder(s) from the
normal or binomial distribution. The package supports identity (normal),
log (Poisson or gamma), and logit (Bernoulli) link functions.
Prior distributions for the model parameters will be jointly defined
as π(θ),
where θ = (β, λ, γ, ζ, δ, ξ).
The default prior structure is weakly informative. The regression
coefficients have a relatively non-informative prior with a mean of 0
and precision (inverse of the variance) of 0.1 when the response is
discrete. When the response is continuous, the regression coefficients
have a relatively non-informative prior with a mean of 0 and precision
of 0.001. To further customize the analysis, users can specify custom
priors using the priors
argument within the modeling
function, unm_glm()
. The format for specifying custom
priors is c("{parameter}[{covariate}]" = "{distribution}")
.
An example of this is below.
Families specified for the response and unmeasured confounder(s) may
present nuisance parameters, necessitating the inclusion of their prior
distributions as well. The precision parameter, τ*, on a normal response
or normal unmeasured confounder will have a Gamma(0.001, 0.001) as the
default prior. Priors can also be elicited in terms of σ* or σ*2 through
priors
. The nuisance parameter, αy, for a gamma
response has a gamma distribution as the prior with both scale and rate
set to 0.1. The aforementioned nuisance parameters are tracked and
posterior summaries are provided as a default setting in the package,
but this can be modified.
The function runm()
is used to generate data. In the
workflow of this package, this function is not required to use if one
wishes to analyze a data set of interest. runm()
can be
used to perform a simulation study, if desired, and data is generated as
follows. The perfectly measured covariates and unmeasured confounder(s)
are generated independently of one another. The user can specify these
variables’ families and their respective distributions as named lists in
runm()
. Then, the data frame will be generated using the
named vector of response model coefficients, βR, and
treatment model coefficients, βT, that
the user provides in the function. runm()
will model the
following:
where w′ = [c′ u1 u2]′
and z′ = [x c′ u1 u2]′.
All arguments in runm()
have a default value assigned other
than n. So, in its simplest
form, one can generate a data set consisting of 100 observations by
calling runm(100)
. The default arguments can be customized
to the user’s preference if there is a desired data generation
structure. An example of internal and external validation data is
provided.
When the validation type is internal (default), the data will first
be generated from some sample size n
. Internal validation
is available when the unmeasured confounder is ascertained for a,
typically, small subset of the main study data. In certain cases, only
internal validation data may be accessible and information on the
unmeasured confounder is only known for a subset of the patients. A
researcher can vary the argument missing_prop
to determine
how large validation studies need to be in a simulation.
missing_prop
will set the assigned proportion of the
unmeasured confounder’s observations to NA
. A more detailed
example is below, assigned df
, and will be the data frame
used throughout the remainder of this vignette.
library("unmconf")
library("bayesplot")
library("ggplot2"); theme_set(theme_minimal())
set.seed(13L)
df <-
runm(n = 100,
type = "int",
missing_prop = .75,
covariate_fam_list = list("norm", "bin", "norm"),
covariate_param_list = list(c(mean = 0, sd = 1), c(.3), c(0, 2)),
unmeasured_fam_list = list("norm", "bin"),
unmeasured_param_list = list(c(mean = 0, sd = 1), c(.3)),
treatment_model_coefs =
c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
"u1" = .75, "u2" = .75),
response_model_coefs =
c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
"u1" = .75, "u2" = .75, "x" = .75),
response = "norm",
response_param = c("si_y" = 1))
rbind(head(df, 5), tail(df, 5))
#> # A tibble: 10 × 7
#> z1 z2 z3 u1 u2 x y
#> <dbl> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 0.554 0 -5.69 0.614 1 0 -2.97
#> 2 -0.280 1 3.43 0.413 0 0 -1.28
#> 3 1.78 1 -2.46 -0.459 1 0 -0.993
#> 4 0.187 0 -0.628 -0.673 0 0 -2.36
#> 5 1.14 0 -0.140 0.193 0 1 0.230
#> 6 -0.256 0 1.00 NA NA 0 0.527
#> 7 -1.23 0 -0.866 NA NA 0 -0.479
#> 8 0.214 1 -0.680 NA NA 0 -0.822
#> 9 0.0672 0 -4.11 NA NA 0 -2.92
#> 10 0.857 1 -0.360 NA NA 0 0.00750
For external validation, the sample size argument can be a vector of
length 2 to represent the number of observations in the main study data
and external validation data, respectively. The sample size argument can
also be of length 1, where the sample size will be split in half for the
two types of data (main study data will obtain the additional
observation if n
is odd). The main study data has the
unmeasured confounders completely missing for all observations. The
external data fully observes the unmeasured confounders but typically
has no reference on the exposure-unmeasured confounder relationship.
Thus, informative priors on the bias parameters for this relationship
are needed to achieve convergence. That is, in the above model, γx and δx.
If a user has a main study data set and an external validation data
set that are collected and separate from one another, they can be joined
together through dplyr::bind_rows()
. This function binds
the data sets by row and keeps all of the columns, even if the columns
do not match between data sets. Say that a researcher has a main study
data set with a binary outcome, a binary treatment, three perfectly
measured standard normal covariates, and two unmeasured confounders that
are completely missing. Further, say that the researcher also has an
external validation data set with the same outcome, two of the perfectly
measured covariates from the main study data, and the two unmeasured
confounders completely observed. If the third covariate that is missing
from the external validation data is deemed unimportant for modeling,
then one can still perform inference using this data. An example of this
scenario is below
# Main Study Data
M <- tibble::tibble(
"y" = rbinom(100, 1, .5),
"x" = rbinom(100, 1, .3),
"z1" = rnorm(100, 0, 1),
"z2" = rnorm(100, 0, 1),
"z3" = rnorm(100, 0, 1),
"u1" = NA,
"u2" = NA
)
# External Validation Data
EV <- tibble::tibble(
"y" = rbinom(100, 1, .5),
"z1" = rnorm(100, 0, 1),
"z2" = rnorm(100, 0, 1),
"u1" = rnorm(100, 0, 1),
"u2" = rnorm(100, 0, 1)
)
df_ext <- dplyr::bind_rows(M, EV) |>
dplyr::mutate(x = ifelse(is.na(x), 0, x))
rbind(head(df_ext, 5), tail(df_ext, 5))
#> # A tibble: 10 × 7
#> y x z1 z2 z3 u1 u2
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 0.761 -0.313 -0.952 NA NA
#> 2 1 0 -0.695 -0.942 -0.107 NA NA
#> 3 0 0 0.564 0.0399 -1.15 NA NA
#> 4 1 1 0.706 1.96 -0.744 NA NA
#> 5 1 0 0.418 -0.251 -0.0683 NA NA
#> 6 0 0 1.60 -0.232 NA 0.676 -0.584
#> 7 1 0 -1.41 -0.621 NA -2.38 0.700
#> 8 1 0 0.988 -0.214 NA -1.18 -0.206
#> 9 1 0 -0.561 -0.506 NA 0.474 -1.15
#> 10 1 0 0.448 0.930 NA -0.958 -0.533
The main focus of this vignette should be around
unm_glm()
, which fits the posterior results of the Bayesian
unmeasured confounding model through MCMC iterations. Upon acquiring all
the relevant information, the unm_glm()
function carries
out two main tasks. Initially, it constructs a JAGS model and
subsequently pre-processes the data for utilization by JAGS. This
simplifies the process of performing a fully Bayesian analysis, as it
spares users from the necessity of being familiar with JAGS syntax for
both the model and data processing. The primary aim of the
unmconf
package, akin to rstanarm
and
brms
, is to offer a user-friendly interface for Bayesian
analysis, utilizing programming techniques familiar to R users. Users
can input model information into unm_glm()
in a similar
manner as they would for the standard stats::glm()
function, providing arguments like formula
,
family
, and data
.
The R language provides a straightforward syntax for denoting linear
models, typically written as response ~ terms
, where the
coefficients are implicitly represented. Like other R functions for
model fitting, users have the option to use the . - {vars}
syntax instead of listing all predictors. For instance, if the user
wants to model the first unmeasured confounder, smoking
,
given predictors age, weight, height,
and
salary
, they can use either
form2 = smoking ~ age + weight + height + salary
or
form2 = smoking ~ . - {response varaible}
. To estimate the
unmeasured confounder(s), we often model them conditioned on some or all
of the perfectly measured covariates and the treatment. Once estimated,
these unmeasured confounder(s) become predictors in the higher levels of
the model structure. On the right-hand side of the ~
, we
define the linear combination of predictors that models the response
variable. Additionally, the predictors can include polynomial regression
(e.g.~ poly(z, 2)
) and interactive effects
(e.g.~x*z
).
unm_glm()
also accepts arguments that facilitate MCMC
computation on the posterior distribution to be passed to
coda.samples
, such as such as
n.iter, n.adapt, thin,
and n.chains
. The
arguments specified have default values, but the user is encouraged to
supply their own values given the lack of convergence that is sometimes
observed when validation sample sizes are small or priors are
particularly diffuse.
For instance, consider the Bayesian unmeasured confounding model for
the generated data set above, df
, with a normal response,
normal first unmeasured confounder, binary second unmeasured confounder,
and three perfectly observed covariates. Further, let’s say that,
through expert opinion, we have prior information that the effect on
y from u1, λu1,
is normally distributed with a mean of 0.5 and standard deviation of 1.
JAGS parameterizes the normal distribution with precision rather than
standard deviation, so we would use τu1 = 1/σu12 = 1
for the prior distribution λu1 ∼ N(.5, τu1 = 1).
Using unm_glm()
, we fit:
unm_mod <-
unm_glm(form1 = y ~ x + z1 + z2 + z3 + u1 + u2, # y ~ .,
form2 = u1 ~ x + z1 + z2 + z3 + u2, # u1 ~ . - y,
form3 = u2 ~ x + z1 + z2 + z3, # u2 ~ . - y - u1,
family1 = gaussian(),
family2 = gaussian(),
family3 = binomial(),
priors = c("lambda[u1]" = "dnorm(.5, 1)"),
n.iter = 10000, n.adapt = 4000, thin = 1,
data = df)
We fit the Bayesian unmeasured confounding model from the external
validation data set above, df_ext
. Suppose that we have
knowledge on the relationship between the treatment effect and both
unmeasured confounders through expert opinion, where each relationship
is normally distributed. JAGS parameterizes the normal distribution with
precision rather than standard deviation. So, from expert opinion, we
get the prior distributions γx ∼ N(1.1, 0.9)
and δx ∼ N(1.1, 4.5).
Using unm_glm()
, we fit:
unm_mod_ext <-
unm_glm(form1 = y ~ x + z1 + z2 + u1 + u2, # y ~ . - z3,
form2 = u1 ~ x + z1 + z2 + u2, # u1 ~ . - y - z3,
form3 = u2 ~ x + z1 + z2, # u2 ~ . - y - u1 - z3,
family1 = binomial(),
family2 = gaussian(),
family3 = gaussian(),
priors = c("gamma[x]" = "dnorm(1.1, 0.9)",
"delta[x]" = "dnorm(1.1, 4.5)"),
n.iter = 4000, n.adapt = 2000, thin = 1,
data = df_ext)
By leveraging unm_glm()
, users can conveniently
implement complex Bayesian models, particularly those involving
unmeasured confounders, without grappling with the intricacies of JAGS
syntax or handling missing data. The Bayesian unmeasured confounding
model structure of unmconf
is currently set up to work with
at most two unmeasured confounders. Instances may arise where the user
may want to work with more than two unmeasured confounders. As an
attempt to resolve this concern, the user can explicitly call the JAGS
code that unm_glm()
generates either through the argument
code_only = TRUE
in the function itself or through the
separate function, jags_code()
. With a starting point
created by the model in unmconf
, the user should be able to
identify the syntax for JAGS code and thus add the layer(s) to the model
structure and the respective prior distribution(s). Stopping at two
unmeasured confounders allows for the package to run with confidence in
the instance that individuals do not check for convergence and report
the results of a poor model. Below shows both ways to extract a model’s
JAGS code.
After the model is fit and before using the MCMC samples for
inference, it is necessary for users assess whether the chains have
converged appropriately. Hierarchical models with unmeasured confounding
are often confronted with convergence issues. To aid in chain
convergence, we heavily increased the burn-in length and MCMC iterations
from the default values of 1000 and 2000 in the function
unm_glm()
to 6000 and 10000, respectively. Additional
checks include the posterior kernel density plots appearing relatively
smooth in shape and the trace, or history, plots of the chains should
have very similar values across the iterations (i.e., they “mix” well
and the chains intermingle).
The bayesplot
package provides a variety of
ggplot2
-based plotting functions for use after fitting
Bayesian models. bayesplot::mcmc_hist()
plots a histogram
of the MCMC draws from all chains, and
bayesplot::mcmc_trace()
performs a trace plot of the
chains. Given that βx is our
parameter of interest, we only displayed the density and trace plots of
this parameter below. The histogram appears smooth and without any
jaggedness. The trace plot appears to mix well, as one cannot
differentiate or identify patterns in the chains across the iterations
for this model. Thus, there is no lack of convergence evident here. All
parameters in the model upheld convergence standards.
bayesplot::mcmc_hist(unm_mod, pars = "beta[x]")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Once the posterior samples have been computed, the
unm_glm()
function returns an R object as a list of the
posterior samples, where the length of the list matches the number of
chains. Calling the returned object explicitly, in our example
unm_mod
, will output the model’s call and a named vector of
coefficients at each level of the Bayesian unmeasured confounding model.
A more formal model summary comes through unm_summary()
,
which obtains and prints a summary table of the results. Every parameter
is summarized by the mean of the posterior distribution along with its
two-sided 95% credible intervals based on quantiles. When generating a
data set via runm()
, adding the data
argument
appends a column to the summary table consisting of the true parameter
values that were assigned.
unm_summary(unm_mod, df) |>
head(10)
#> # A tibble: 10 × 9
#> param true_value mean sd `2.5%` `50%` `97.5%` r_hat ess
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 beta[1] NA -0.954 0.275 -1.50 -0.954 -0.416 1.01 3477.
#> 2 beta[x] 0.75 0.997 0.431 0.147 0.998 1.84 1.00 3305.
#> 3 beta[z1] 0.4 0.511 0.205 0.114 0.508 0.920 1.00 4571.
#> 4 beta[z2] 0.5 0.261 0.385 -0.483 0.255 1.03 1.00 6084.
#> 5 beta[z3] 0.4 0.381 0.0970 0.184 0.382 0.568 1.00 3376.
#> 6 delta[1] NA -1.38 0.768 -2.99 -1.34 0.0308 1.02 1531.
#> 7 delta[x] NA 1.10 1.04 -0.896 1.09 3.24 1.02 2074.
#> 8 delta[z1] NA 0.384 0.592 -0.752 0.377 1.57 1.01 2606.
#> 9 delta[z2] NA 1.51 1.18 -0.707 1.48 3.93 1.01 2009.
#> 10 delta[z3] NA -0.517 0.256 -1.07 -0.499 -0.0711 1.01 2763.
For model comparison, the deviance information criterion (DIC) and
penalized expected deviance are provided through unm_dic()
.
DIC, a Bayesian version of AIC, is calculated by adding the “effective
number of parameters” to the expected deviance and is computed through a
wrapper around rjags::dic_samples()
.
As mentioned above, unmeasured confounders can be considered as
parameters in the Bayesian paradigm and are therefore estimated when
performing MCMC. Yet, unm_summary()
does not track the
estimate of these “parameters” from the model fit. For this,
unm_backfill()
pairs with the original data set to impute
the missing values for the unmeasured confounders with the posterior
estimates. The ten observations below come from df
, where
five of the unmeasured confounders are observed and five are unobserved
in the internal validation data. Columns u1_observed
and
u2_observed
are logical variables added to the original
data frame to display which variables were originally observed versus
imputed. Additionally, the values of u1
and u2
that respectively have FALSE
in the previously mentioned
columns are the imputed posterior estimates.
unm_backfill(df, unm_mod)[16:25, ]
#> # A tibble: 10 × 9
#> z1 z2 z3 u1 u2 x y u1_observed u2_observed
#> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl> <lgl> <lgl>
#> 1 -0.194 0 -1.97 0.0284 0 0 -1.53 TRUE TRUE
#> 2 1.40 0 0.568 -0.706 1 1 1.42 TRUE TRUE
#> 3 0.101 0 0.0589 0.635 0 1 0.430 TRUE TRUE
#> 4 -0.114 1 1.60 1.17 0 1 1.40 TRUE TRUE
#> 5 0.702 0 -2.61 -0.308 1 0 0.189 TRUE TRUE
#> 6 0.263 0 5.54 -0.704 0 1 0.442 TRUE TRUE
#> 7 1.84 0 -0.691 -1.27 1 0 -0.174 TRUE TRUE
#> 8 0.357 1 2.01 0.0956 0 0 2.89 TRUE TRUE
#> 9 -1.05 1 5.63 -0.447 1 1 1.55 TRUE TRUE
#> 10 0.620 0 1.66 -1.71 1 1 1.37 TRUE TRUE
To visualize the results from the model fit, the quantile-based
posterior credible intervals can be drawn using
bayesplot::mcmc_intervals()
. A credible interval plot from
the worked example is below. This plots the credible intervals for all
parameters from the posterior draws of all the chains. We modify the
default setting for the outer credible interval to be 0.95 for
comparison with the output results from unm_summary()
. The
light blue circle in the middle of each parameter’s interval portrays
the posterior median. The bold, dark blue line displays the 50% credible
interval and the thin, blue line covers the 95% credible interval. For
simulation studies, where the true parameter values are known, calling
the argument true_params
will add a layer of light red
circles to each credible interval to illustrate the true value used to
generate the data. Here, the gamma and delta parameters do not have a
true value because we generated the unmeasured confounders u1 and u2 as independent
normal/Bernoulli random variables.
bayesplot::mcmc_intervals(unm_mod, prob_outer = .95,
regex_pars = "(beta|lambda|gamma|delta|zeta).+") +
geom_point(
aes(value, name), data = tibble::enframe(attr(df, "params")) |>
dplyr::mutate(name = gsub("int", "1", name)),
color = "red", fill = "pink", size = 4, shape = 21
)