BayesPostEst contains functions to generate postestimation quantities
after estimating Bayesian regression models. The package was inspired by
a set of functions written originally for Johannes Karreth’s workshop on
Bayesian modeling at the ICPSR Summer
program. It has grown to include new functions (see
mcmcReg
) and will continue to grow to support Bayesian
postestimation. For now, the package focuses mostly on generalized
linear regression models for binary outcomes (logistic and probit
regression). More details on the package philosophy, its functions, and
related packages can be found in Scogin et al.
(2019).
To install the latest release on CRAN:
The latest development version on GitHub can be installed with:
Once you have installed the package, you can access it by calling:
After the package is loaded, check out the ?BayesPostEst
to see a help file.
Most functions in this package work with posterior distributions of parameters. These distributions need to be converted into a matrix. All functions in the package do this automatically for posterior draws generated by JAGS, BUGS, MCMCpack, rstan, and rstanarm. For posterior draws generated by other tools, users must convert these objects into a matrix, where rows represent iterations and columns represent parameters.
This vignette uses the Cowles
dataset (Cowles and Davis 1987) from the carData package
(Fox, Weisberg, and Price 2018).
This data frame contains information on 1421 individuals in the following variables:
Before proceeding, we convert the two factor variables
sex
and volunteer
into numeric variables. We
also means-center and standardize the two continuous variables by
dividing each by two standard deviations (Gelman
and Hill 2007).
df$female <- (as.numeric(df$sex) - 2) * (-1)
df$volunteer <- as.numeric(df$volunteer) - 1
df$extraversion <- (df$extraversion - mean(df$extraversion)) / (2 * sd(df$extraversion))
df$neuroticism <- (df$neuroticism - mean(df$neuroticism)) / (2 * sd(df$neuroticism))
We estimate a Bayesian generalized linear model with the inverse logit link function, where
Pr(Volunteeringi) = logit−1(β1 + β2Femalei + β3Neuroticismi + β4Extraversioni)
BayesPostEst
functions accommodate GLM estimates for
both logit and probit link functions. The examples proceed with the
logit link function. If we had estimated a probit regression, the
corresponding argument link
in relevant function calls
would need to be set to link = "probit"
. Otherwise, it is
set to link = "logit"
by default.
To use BayesPostEst
, we first estimate a Bayesian
regression model. This vignette demonstrates five tools for doing so:
JAGS (via the R2jags and rjags packages), MCMCpack, and the
two Stan interfaces rstan and rstanarm.
First, we prepare the data for JAGS (Plummer 2017). Users need to combine all variables into a list and specify any other elements, like in this case N, the number of observations.
We then write the JAGS model into the working directory.
mod.jags <- paste("
model {
for (i in 1:N){
volunteer[i] ~ dbern(p[i])
logit(p[i]) <- mu[i]
mu[i] <- b[1] + b[2] * female[i] + b[3] * neuroticism[i] + b[4] * extraversion[i]
}
for(j in 1:4){
b[j] ~ dnorm(0, 0.1)
}
}
")
writeLines(mod.jags, "mod.jags")
We then define the parameters for which we wish to retain posterior distributions and provide starting values.
params.jags <- c("b")
inits1.jags <- list("b" = rep(0, 4))
inits.jags <- list(inits1.jags, inits1.jags, inits1.jags, inits1.jags)
Now, fit the model using the R2jags package.
library("R2jags")
set.seed(123)
fit.jags <- jags(data = dl, inits = inits.jags,
parameters.to.save = params.jags, n.chains = 4, n.iter = 2000,
n.burnin = 1000, model.file = "mod.jags")
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 1421
#> Unobserved stochastic nodes: 4
#> Total graph size: 6864
#>
#> Initializing model
The same data and model can be used to fit the model using the rjags package:
library("rjags")
mod.rjags <- jags.model(file = "mod.jags", data = dl, inits = inits.jags,
n.chains = 4, n.adapt = 1000)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 1421
#> Unobserved stochastic nodes: 4
#> Total graph size: 6864
#>
#> Initializing model
fit.rjags <- coda.samples(model = mod.rjags,
variable.names = params.jags,
n.iter = 2000)
We estimate the same model using MCMCpack (Martin, Quinn, and Park 2011).
We write the same model in Stan language.
mod.stan <- paste("
data {
int<lower=0> N;
int<lower=0,upper=1> volunteer[N];
vector[N] female;
vector[N] neuroticism;
vector[N] extraversion;
}
parameters {
vector[4] b;
}
model {
volunteer ~ bernoulli_logit(b[1] + b[2] * female + b[3] * neuroticism + b[4] * extraversion);
for(i in 1:4){
b[i] ~ normal(0, 3);
}
}
")
writeLines(mod.stan, "mod.stan")
We then load rstan (Stan Development Team 2019)…
… and estimate the model, re-using the data in list format created for JAGS earlier.
Lastly, we use the rstanarm interface (Goodrich et al. 2019) to estimate the same model again.
BayesPostEst contains functions to generate regression tables from
objects created by the following packages: R2jags, runjags, rjags,
R2WinBUGS, MCMCpack, rstan, rstanarm, and brms. This includes the
following object classes: jags
, rjags
,
bugs
, mcmc
, mcmc.list
,
stanreg
, stanfit
, brmsfit
. The
package contains two different functions to produce regression
tables:
mcmcTab
mcmcReg
Each has its own advantages which we discuss in depth below.
mcmcTab
generates a table summarizing the posterior
distributions of all parameters contained in the model object. This
table can then be used to summarize parameter quantities. By default,
mcmcTab
generates a dataframe with one row per parameter
and columns containing the median, standard deviation, and 95% credible
interval of each parameter’s posterior distribution.
mcmcTab(fit.jags)
#> Variable Median SD Lower Upper
#> 1 b[1] -0.456 0.083 -0.624 -0.299
#> 2 b[2] 0.235 0.112 0.017 0.459
#> 3 b[3] 0.062 0.113 -0.156 0.285
#> 4 b[4] 0.518 0.110 0.300 0.732
#> 5 deviance 1909.435 2.908 1906.515 1917.313
mcmcTab(fit.rjags)
#> Variable Median SD Lower Upper
#> 1 b[1] -0.460 0.082 -0.622 -0.299
#> 2 b[2] 0.237 0.112 0.021 0.452
#> 3 b[3] 0.063 0.112 -0.156 0.278
#> 4 b[4] 0.519 0.110 0.305 0.734
mcmcTab(fit.MCMCpack)
#> Variable Median SD Lower Upper
#> 1 (Intercept) -0.463 0.083 -0.612 -0.304
#> 2 female 0.231 0.110 0.016 0.435
#> 3 neuroticism 0.058 0.112 -0.147 0.321
#> 4 extraversion 0.509 0.102 0.320 0.718
mcmcTab(fit.rstanarm)
#> Variable Median SD Lower Upper
#> 1 (Intercept) -0.459 0.084 -0.624 -0.293
#> 2 female 0.236 0.111 0.014 0.449
#> 3 neuroticism 0.065 0.111 -0.149 0.280
#> 4 extraversion 0.521 0.109 0.306 0.730
Users can add a column to the table that calculates the percent of posterior draws that have the same sign as the median of the posterior distribution.
Users can also define a “region of practical equivalence” (ROPE; J. K. Kruschke (2013); J. K. Kruschke (2018)). This region is a band of values around 0 that are “practically equivalent” to 0 or no effect. For this to be useful, all parameters (e.g. regression coefficients) must be on the same scale because mcmcTab accepts only one definition of ROPE for all parameters. Users can standardize regression coefficients to achieve this. Because we standardized variables earlier, the coefficients (except the intercept) are on a similar scale and we define the ROPE to be between -0.1 and 0.1.
The mcmcReg
function serves as an interface to
texreg
and produces more polished and publication-ready
tables than mcmcTab
. mcmcReg
writes tables in
HTML or LaTeX format. mcmcReg
can produce tables with
multiple models with each model in a column and supports flexible
renaming of parameters. However, these tables are more similar to
standard frequentist regression tables, so they do not have a way to
incorporate the percent of posterior draws that have the same sign as
the median of the posterior distribution or a ROPE like
mcmcTab
is able to. Uncertainty intervals can be either
standard credible intervals or highest posterior density intervals (J. Kruschke 2015) using the hpdi
argument, and their level can be set with the ci
argument
(default 95%). Separately calculated goodness of fit statistics can be
included with the gof
argument.
Model 1 | |
---|---|
b[1] | -0.46* |
[ -0.62; -0.30] | |
b[2] | 0.24* |
[ 0.02; 0.46] | |
b[3] | 0.06 |
[ -0.16; 0.29] | |
b[4] | 0.52* |
[ 0.30; 0.73] | |
deviance | 1910.06* |
[1906.52; 1917.31] | |
* 0 outside 95% credible interval. |
mcmcReg
supports limiting the parameters included in the
table via the pars
argument. By default, all parameters
saved in the model object will be included. In the case of
fit.jags
, this include the deviance estimate. If we wish to
exclude it, we can specify pars = 'b'
which will capture
b[1]
-b[4]
using regular expression
matching.
Model 1 | |
---|---|
b[1] | -0.46* |
[-0.62; -0.30] | |
b[2] | 0.24* |
[ 0.02; 0.46] | |
b[3] | 0.06 |
[-0.16; 0.29] | |
b[4] | 0.52* |
[ 0.30; 0.73] | |
* 0 outside 95% credible interval. |
If we only wish to exclude the intercept, we can do this by
explicitly specifying the parameters we wish to include as a vector.
Note that in this example we have to escape the []
s in
pars
because they are a reserved character in regular
expressions.
mcmcReg(fit.jags, pars = c('b\\[1\\]', 'b\\[3\\]', 'b\\[4\\]'),
format = 'html', regex = T, doctype = F)
Model 1 | |
---|---|
b[1] | -0.46* |
[-0.62; -0.30] | |
b[3] | 0.06 |
[-0.16; 0.29] | |
b[4] | 0.52* |
[ 0.30; 0.73] | |
* 0 outside 95% credible interval. |
mcmcReg
also supports partial regular expression
matching of multiple parameter family names as demonstrated below.
Model 1 | |
---|---|
b[1] | -0.46* |
[ -0.62; -0.30] | |
b[2] | 0.24* |
[ 0.02; 0.46] | |
b[3] | 0.06 |
[ -0.16; 0.29] | |
b[4] | 0.52* |
[ 0.30; 0.73] | |
deviance | 1910.06* |
[1906.52; 1917.31] | |
* 0 outside 95% credible interval. |
mcmcReg
supports custom coefficient names to support
publication-ready tables. The simplest option is via the
coefnames
argument. Note that the number of parameters and
the number of custom coefficient names must match, so it is a good idea
to use pars
in tandem with coefnames
.
mcmcReg(fit.jags, pars = 'b',
coefnames = c('(Constant)', 'Female', 'Neuroticism', 'Extraversion'),
format = 'html', regex = T, doctype = F)
Model 1 | |
---|---|
(Constant) | -0.46* |
[-0.62; -0.30] | |
Female | 0.24* |
[ 0.02; 0.46] | |
Neuroticism | 0.06 |
[-0.16; 0.29] | |
Extraversion | 0.52* |
[ 0.30; 0.73] | |
* 0 outside 95% credible interval. |
A more flexible way to include custom coefficient names is via the
custom.coef.map
argument, which accepts a named list, with
names as parameter names in the model and values as the custom
coefficient names.
mcmcReg(fit.jags, pars = 'b',
custom.coef.map = list('b[1]' = '(Constant)',
'b[2]' = 'Female',
'b[3]' = 'Nueroticism',
'b[4]' = 'Extraversion'),
format = 'html', regex = T, doctype = F)
Model 1 | |
---|---|
(Constant) | -0.46* |
[-0.62; -0.30] | |
Female | 0.24* |
[ 0.02; 0.46] | |
Nueroticism | 0.06 |
[-0.16; 0.29] | |
Extraversion | 0.52* |
[ 0.30; 0.73] | |
* 0 outside 95% credible interval. |
The advantage of custom.coef.map
is that it can flexibly
reorder and omit coefficients from the table based on their positions
within the list. Notice in the code below that deviance does not have to
be included in pars
because its absence from
custom.coef.map
omits it from the resulting table.
mcmcReg(fit.jags,
custom.coef.map = list('b[2]' = 'Female',
'b[4]' = 'Extraversion',
'b[1]' = '(Constant)'),
format = 'html', doctype = F)
Model 1 | |
---|---|
Female | 0.24* |
[ 0.02; 0.46] | |
Extraversion | 0.52* |
[ 0.30; 0.73] | |
(Constant) | -0.46* |
[-0.62; -0.30] | |
* 0 outside 95% credible interval. |
However, it is important to remember that mcmcReg
will
look for the parameter names in the model object, so be sure to inspect
it for the correct parameter names. This is important because
stan_glm
will produce a model object with variable names
instead of indexed parameter names.
mcmcReg
accepts multiple model objects and will produce
a table with one model per column. To produce a table from multiple
models, pass a list of models as the mod
argument to
mcmcReg
.
Note, however, that all model objects must be of the same class, so
it is not possible to generate a table from a jags
object and a stanfit
object.
When including multiple models, supplying scalars or vectors to arguments will result in them being applied to each model equally. Treating models differentially is possible by supplying a list of scalars or vectors instead.
mcmcReg(list(fit.rstanarm, fit.rstanarm),
pars = list(c('female', 'extraversion'), 'neuroticism'),
format = 'html', doctype = F)
Model 1 | Model 2 | |
---|---|---|
female | 0.24* | |
[0.01; 0.45] | ||
extraversion | 0.52* | |
[0.31; 0.73] | ||
neuroticism | 0.07 | |
[-0.15; 0.28] | ||
* Null hypothesis value outside 95% credible interval. |
texreg
argumentsAlthough custom.coef.map
is not an argument to
mcmcReg
, it works because mcmcReg
supports all
standard texreg
arguments (a few have been overridden, but
they are explicit arguments to mcmcReg
). This introduces a
high level of control over the output of mcmcReg
, as
e.g. models can be renamed.
Binary Outcome | |
---|---|
(Intercept) | -0.46* |
[-0.62; -0.29] | |
female | 0.24* |
[ 0.01; 0.45] | |
neuroticism | 0.07 |
[-0.15; 0.28] | |
extraversion | 0.52* |
[ 0.31; 0.73] | |
* 0 outside 95% credible interval. |
mcmcAveProb
To evaluate the relationship between covariates and a binary outcome,
this function calculates the predicted probability (Pr(y = 1)) at
pre-defined values of one covariate of interest (x), while all other covariates are
held at a “typical” value. This follows suggestions outlined in King, Tomz, and Wittenberg (2000) and elsewhere,
which are commonly adopted by users of GLMs. The
mcmcAveProb
function by default calculates the median value
of all covariates other than x
as “typical” values.
Before moving on, we show how create a matrix of posterior draws of
coefficients to pass onto these functions. Eventually, each function
will contain code similar to the first section of mcmcTab
to do this as part of the function.
mcmcmat.jags <- as.matrix(coda::as.mcmc(fit.jags))
mcmcmat.MCMCpack <- as.matrix(fit.MCMCpack)
mcmcmat.rstanarm <- as.matrix(fit.rstanarm)
Next, we generate the model matrix to pass on to the function. A model matrix contains as many columns as estimated regression coefficients. The first column is a vector of 1s (corresponding to the intercept); the remaining columns are the observed values of covariates in the model. Note: the order of columns in the model matrix must correspond to the order of columns in the matrix of posterior draws.
We can now generate predicted probabilities for different values of a covariate of interest.
First, we generate full posterior distributions of the predicted
probability of volunteering for a typical female and a typical male. In
this function and mcmcObsProb
, users specify the range of
x (here 0 and 1) as well as
the number of the column of x
in the matrix of posterior draws as well as the model matrix.
aveprob.female.jags <- mcmcAveProb(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
xcol = 2,
xrange = c(0, 1),
link = "logit",
ci = c(0.025, 0.975),
fullsims = TRUE)
Users can then visualize this posterior distribution using the
ggplot2 and ggridges
packages.
library("ggplot2")
library("ggridges")
ggplot(data = aveprob.female.jags,
aes(y = factor(x), x = pp)) +
stat_density_ridges(quantile_lines = TRUE,
quantiles = c(0.025, 0.5, 0.975), vline_color = "white") +
scale_y_discrete(labels = c("Male", "Female")) +
ylab("") +
xlab("Estimated probability of volunteering") +
labs(title = "Probability based on average-case approach") +
theme_minimal()
For continuous variables of interest, users may want to set
fullsims = FALSE
to obtain the median predicted probability
along the range of x as well
as a lower and upper bound of choice (here, the 95% credible
interval).
aveprob.extra.jags <- mcmcAveProb(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
xcol = 4,
xrange = seq(min(df$extraversion), max(df$extraversion), length.out = 20),
link = "logit",
ci = c(0.025, 0.975),
fullsims = FALSE)
Users can then plot the resulting probabilities using any plotting functions, such as ggplot2.
ggplot(data = aveprob.extra.jags,
aes(x = x, y = median_pp)) +
geom_ribbon(aes(ymin = lower_pp, ymax = upper_pp), fill = "gray") +
geom_line() +
xlab("Extraversion") +
ylab("Estimated probability of volunteering") +
ylim(0, 1) +
labs(title = "Probability based on average-case approach") +
theme_minimal()
mcmcObsProb
As an alternative to probabilities for “typical” cases, Hanmer and Kalkan (2013) suggest to calculate predicted probabilities for all observed cases and then derive an “average effect”. In their words, the goal of this postestimation “is to obtain an estimate of the average effect in the population … rather than seeking to understand the effect for the average case.”
We first calculate the average “effect” of sex on volunteering, again
generating a full posterior distribution. Again, xcol
represents the position of the covariate of interest, and
xrange
specifies the values for which Pr(y = 1) is to be
calculated.
obsprob.female.jags <- mcmcObsProb(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
xcol = 2,
xrange = c(0, 1),
link = "logit",
ci = c(0.025, 0.975),
fullsims = TRUE)
Users can again plot the resulting densities.
ggplot(data = obsprob.female.jags,
aes(y = factor(x), x = pp)) +
stat_density_ridges(quantile_lines = TRUE,
quantiles = c(0.025, 0.5, 0.975), vline_color = "white") +
scale_y_discrete(labels = c("Male", "Female")) +
ylab("") +
xlab("Estimated probability of volunteering") +
labs(title = "Probability based on observed-case approach") +
theme_minimal()
For this continuous predictor, we use
fullsims = FALSE
.
obsprob.extra.jags <- mcmcObsProb(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
xcol = 4,
xrange = seq(min(df$extraversion), max(df$extraversion), length.out = 20),
link = "logit",
ci = c(0.025, 0.975),
fullsims = FALSE)
We then plot the resulting probabilities across observed cases.
ggplot(data = obsprob.extra.jags,
aes(x = x, y = median_pp)) +
geom_ribbon(aes(ymin = lower_pp, ymax = upper_pp), fill = "gray") +
geom_line() +
xlab("Extraversion") +
ylab("Estimated probability of volunteering") +
ylim(0, 1) +
labs(title = "Probability based on observed-case approach") +
theme_minimal()
mcmcFD
To summarize typical effects across covariates, we generate “first differences” (Long (1997), King, Tomz, and Wittenberg (2000)). This quantity represents, for each covariate, the difference in predicted probabilities for cases with low and high values of the respective covariate. For each of these differences, all other variables are held constant at their median.
fdfull.jags <- mcmcFD(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
link = "logit",
ci = c(0.025, 0.975),
fullsims = TRUE)
summary(fdfull.jags)
#> female neuroticism extraversion
#> Min. :-0.04353 Min. :-0.068797 Min. :0.003177
#> 1st Qu.: 0.03919 1st Qu.:-0.002028 1st Qu.:0.070524
#> Median : 0.05731 Median : 0.011002 Median :0.081951
#> Mean : 0.05762 Mean : 0.011168 Mean :0.081990
#> 3rd Qu.: 0.07595 3rd Qu.: 0.024618 3rd Qu.:0.093703
#> Max. : 0.15385 Max. : 0.093961 Max. :0.141439
The posterior distribution can be summarized as above, or users can
directly obtain a summary when setting fullsims
to
FALSE.
fdsum.jags <- mcmcFD(modelmatrix = mm,
mcmcout = mcmcmat.jags[, 1:ncol(mm)],
link = "logit",
ci = c(0.025, 0.975),
fullsims = FALSE)
fdsum.jags
#> median_fd lower_fd upper_fd VarName VarID
#> female 0.05731141 0.004246989 0.11143110 female 1
#> neuroticism 0.01100154 -0.027664877 0.05055301 neuroticism 2
#> extraversion 0.08195073 0.047543253 0.11530027 extraversion 3
Users can plot the median and credible intervals of the summary of the first differences.
mcmcFD
objectsTo make use of the full posterior distribution of first differences,
we provide a dedicated plotting method, plot.mcmcFD
, which
returns a ggplot2 object that can be further customized. The function is
modeled after Figure 1 in Karreth (2018).
Users can specify a region of practical equivalence and print the
percent of posterior draws to the right or left of the ROPE. If ROPE is
not specified, the figure automatically prints the percent of posterior
draws to the left or right of 0.
The user can further customize the plot.
mcmcRocPrc
One way to assess model fit is to calculate the area under the
Receiver Operating Characteristic (ROC) and Precision-Recall curves. A
short description of these curves and their utility for model assessment
is provided in Beger (2016). The
mcmcRocPrc
function produces an object with four elements:
the area under the ROC curve, the area under the PR curve, and two
dataframes to plot each curve. When fullsims
is set to
FALSE
, the elements represent the median of the posterior
distribution of each quantity.
mcmcRocPrc
currently requires an “rjags” object (a model
fitted in R2jags) as input. Future package versions will generalize this
input to allow for model objects fit with any of the other packages used
in BayesPostEst.
fitstats <- mcmcRocPrc(object = fit.jags,
yname = "volunteer",
xnames = c("female", "neuroticism", "extraversion"),
curves = TRUE,
fullsims = FALSE)
Users can then print the area under the each curve:
Users can also plot the ROC curve…
ggplot(data = as.data.frame(fitstats, what = "roc"), aes(x = x, y = y)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, color = "gray") +
labs(title = "ROC curve") +
xlab("1 - Specificity") +
ylab("Sensitivity") +
theme_minimal()
… as well as the precision-recall curve.
ggplot(data = as.data.frame(fitstats, what = "prc"), aes(x = x, y = y)) +
geom_line() +
labs(title = "Precision-Recall curve") +
xlab("Recall") +
ylab("Precision") +
theme_minimal()
To plot the posterior distribution of the area under the curves,
users set the fullsims
argument to TRUE
.
Unless a user wishes to plot credible intervals around the ROC and PR
curves themselves, we recommend keeping curves
at
FALSE
to avoid long computation time.
fitstats.fullsims <- mcmcRocPrc(object = fit.jags,
yname = "volunteer",
xnames = c("female", "neuroticism", "extraversion"),
curves = FALSE,
fullsims = TRUE)
We can then plot the posterior density of the area under each curve.