The task of model selection
targets the question: If there are several competing models, how do I
choose the most appropriate one? This vignette1 outlines the model
selection tools implemented in {RprobitB}
.
For illustration, we revisit the probit model of travelers deciding between two fictional train route alternatives from the vignette on model fitting:
form <- choice ~ price + time + change + comfort | 0
data <- prepare_data(form = form, choice_data = train_choice, id = "deciderID", idc = "occasionID")
model_train <- fit_model(
data = data,
scale = "price := -1",
R = 1000, B = 500, Q = 10
)
As a competing model, we consider explaining the choices only by the
alternative’s price, i.e. the probit model with the formula
choice ~ price | 0
:
The update()
function helps to estimate a new version of
model_train
with new specifications. Here, only
form
has been changed.
model_selection()
function{RprobitB}
provides the convenience function
model_selection()
, which takes an arbitrary number of
RprobitB_fit
objects and returns a matrix of model
selection criteria:
model_selection(model_train, model_train_sparse)
#> model_train model_train_sparse
#> npar 4 1
#> LL -1727.71 -1865.86
#> AIC 3463.42 3733.72
#> BIC 3487.35 3739.70
Specifying criteria
is optional. Per default,
criteria = c("npar", "LL", "AIC", "BIC")
.2 The available model
selection criteria are explained in the following.
npar
"npar"
yields the number of model parameters, which is
computed by the npar()
method:
Here, model_train
has 4 parameters (a coefficient for
price, time, change, and comfort, respectively), while
model_train_sparse
has only a single price coefficient.
LL
If "LL"
is included in criteria
,
model_selection()
returns the model’s log-likelihood
values. They can also be directly accessed via the logLik()
method:3
AIC
Including "AIC"
yields the Akaike’s Information
Criterion (Akaike
1974), which is computed as −2 ⋅ LL + k ⋅ npar, where LL is the model’s
log-likelihood value, k is
the penalty per parameter with k = 2 per default for the classical
AIC, and npar is the number of
parameters in the fitted model.
Alternatively, the AIC()
method also returns the AIC
values:
AIC(model_train, model_train_sparse, k = 2)
#> df AIC
#> model_train 4 3463.419
#> model_train_sparse 1 3733.720
The AIC quantifies the trade-off between over- and under-fitting,
where smaller values are preferred. Here, the increase in goodness of
fit justifies the additional 3 parameters of
model_train
.
BIC
Similar to the AIC, "BIC"
yields the Bayesian
Information Criterion (Schwarz 1978), which is defined as
−2 ⋅ LL + log (nobs) ⋅ npar, where
LL is the model’s
log-likelihood value, nobs is the
number of data points (which can be accessed via the nobs()
method), and npar is the number of parameters in the fitted model. The
interpretation is analogue to AIC.
{RprobitB}
also provided a method for the
BIC
value:
WAIC
(with se(WAIC)
and
pWAIC
)WAIC is short for Widely Applicable (or Watanabe-Akaike) Information
Criterion (Watanabe
and Opper 2010). As for AIC and BIC, the smaller the WAIC
value the better the model. Including "WAIC"
in
criteria
yields the WAIC value, its standard error
se(WAIC)
, and the effective number of parameters
pWAIC
, see below.
The WAIC is defined as
−2 ⋅ lppd + 2 ⋅ pWAIC,
where lppd stands for log pointwise predictive density and pWAIC is a penalty term proportional to the variance in the posterior distribution that is sometimes called effective number of parameters, see McElreath (2020) p. 220 for a reference.
The lppd is approximated as follows. Let psi = Pr (yi ∣ θs) be the probability of observation yi (here the single choices) given the s-th set θs of parameter samples from the posterior. Then
lppd = ∑ilog (S−1∑spsi). The penalty term is computed as the sum over the variances in log-probability for each observation: pWAIC = ∑i𝕍θlog (psi). The WAIC has a standard error of $$\sqrt{n \cdot \mathbb{V}_i \left[-2 \left(\text{lppd} - \mathbb{V}_{\theta} \log (p_{si}) \right)\right]},$$ where n is the number of choices.
Before computing the WAIC of an object, the probabilities psi
must be computed via the compute_p_si()
function: 4
Afterwards, the WAIC can be accessed as follows:
MMLL
"MMLL"
in criteria
stands for marginal
model log-likelihood. The model’s marginal likelihood Pr (y ∣ M) for a model
M and data y is required for the computation of
Bayes factors, see below. In general, the term has no closed form and
must be approximated numerically.
{RprobitB}
uses the posterior Gibbs samples derived from
the mcmc()
function to approximate the model’s marginal
likelihood via the posterior harmonic mean estimator (Newton and Raftery
1994): Let S denote
the number of available posterior samples θ1, …, θS.
Then, $$\Pr(y\mid M) =
\left(\mathbb{E}_\text{posterior} 1/\Pr(y\mid \theta,M) \right)^{-1}
\approx \left( \frac{1}{S} \sum_s 1/\Pr(y\mid \theta_s,M) \right) ^{-1}
= \tilde{\Pr}(y\mid M).$$
By the law of large numbers, $\tilde{\Pr}(y\mid M) \to \Pr(y\mid M)$ almost surely as S → ∞.
As for the WAIC, computing the MMLL relies on the
probabilities psi = Pr (yi ∣ θs),
which must first be computed via the compute_p_si()
function. Afterwards, the mml()
function can be called with
an RprobitB_fit
object as input. The function returns the
RprobitB_fit
object, where the marginal likelihood value is
saved as the entry "mml"
and the marginal log-likelihood
value as the attribute "mmll"
:5
There are two options for improving the approximation: As for the WAIC, you can use more posterior samples. Alternatively, you can combine the posterior harmonic mean estimate with the prior arithmetic mean estimator (Hammersley and Handscomb 1964): For this approach, S samples θ1, …, θS are drawn from the model’s prior distribution. Then,
$$\Pr(y\mid M) = \mathbb{E}_\text{prior} \Pr(y\mid \theta,M) \approx \frac{1}{S} \sum_s \Pr(y\mid \theta_s,M) = \tilde{\Pr}(y\mid M).$$
Again, it hols by the law of large numbers, that $\tilde{\Pr}(y\mid M) \to \Pr(y\mid M)$ almost surely as S → ∞. The final approximation of the model’s marginal likelihood than is a weighted sum of the posterior harmonic mean estimate and the prior arithmetic mean estimate, where the weights are determined by the sample sizes.
To use the prior arithmetic mean estimator, call the
mml()
function with a specification of the number of prior
draws S
and set recompute = TRUE
:6
Note that the prior arithmetic mean estimator works well if the prior
and posterior distribution have a similar shape and strong overlap, as
Gronau et al. (2017) points out. Otherwise, most of
the sampled prior values result in a likelihood value close to zero,
thereby contributing only marginally to the approximation. In this case,
a very large number S
of prior samples is required.
BF
The Bayes factor is an index of relative posterior model plausibility
of one model over another (Marin and Robert 2014). Given data
y
and two models mod0
and mod1
, it is defined as
$$ BF(\texttt{mod0},\texttt{mod1}) = \frac{\Pr(\texttt{mod0} \mid \texttt{y})}{\Pr(\texttt{mod1} \mid \texttt{y})} = \frac{\Pr(\texttt{y} \mid \texttt{mod0} )}{\Pr(\texttt{y} \mid \texttt{mod1})} / \frac{\Pr(\texttt{mod0})}{\Pr(\texttt{mod1})}. $$
The ratio Pr (mod0
)/Pr (mod1
)
expresses the factor by which mod0
a priori is assumed to be
the correct model. Per default, {RprobitB}
sets Pr (mod0
) = Pr (mod1
) = 0.5.
The front part Pr (y
∣ mod0
)/Pr (y
∣ mod1
)
is the ratio of the marginal model likelihoods. A
value of BF(mod0
, mod1
) > 1
means that the model mod0
is more strongly supported by the data under consideration than mod1
.
Adding "BF"
to the criteria
argument of
model_selection
yields the Bayes factors. We find a
decisive evidence (Jeffreys 1998) in favor of
model_train
.
pred_acc
Finally, adding "pred_acc"
to the criteria
argument for the model_selection()
function returns the
share of correctly predicted choices. From the output of
model_selection()
above (or alternatively the one in the
following) we deduce that model_train
correctly predicts
about 6% of the choices more than model_train_sparse
:7
This vignette is built using R 4.4.2 with the
{RprobitB}
1.1.4 package.↩︎
To show the model formulas in the output of
model_selection()
, add the argument
add_form = TRUE
.↩︎
The log-likelihood values are per default computed at
the point estimates derived from the Gibbs sample means.
logLik()
has the argument par_set
, where
alternative statistics for the point estimate can be specified.↩︎
This computation has been outsourced because it is very
time consuming. For example, the computation for
model_train
took about 5 minutes. To decrease the
computation time, the function offers parallelization via specifying the
number ncores
of parallel CPU cores.↩︎
Note that the marginal likelihood value is very small. The given representation is required so that the value is not rounded to 0 by the computer.↩︎
The mml()
function offers parallelization
via specifying the number ncores
of parallel CPU cores.↩︎
See the vignette on choice prediction for more nuanced performance comparison in terms of sensitivity and specificity.↩︎