In this vignette we explore the
fit of Gaussian mixture models and the relabelling pivotal method
proposed in Egidi et al. (2018b) through
the pivmet
package. First of all, we load the package:
The pivmet
package provides a simple framework to (i)
fit univariate and multivariate mixture models according to a Bayesian
flavour and detect the pivotal units, via the piv_MCMC
function; (ii) perform the relabelling step via the piv_rel
function.
There are two main functions used for this task.
The function piv_MCMC
:
performs MCMC sampling for Gaussian mixture
models using the underlying rjags
or rstan
packages (chosen by the users through the optional function argument
software
, by default set to rjags
). Precisely
the package uses: the JAGSrun
function of the
bayesmix
package for univariate mixture models; the
run.jags
function of the runjags
package for
bivariate mixture models; the stan
function of the
rstan
package for both univariate and bivariate mixture
models.
Post-processes the chains and randomly switches their values (Frühwirth-Schnatter 2001).
Builds a co-association matrix for the N units. After H MCMC iterations, the function
implements a clustering procedure with k groups, the user may choose among
agglomerative or divisive hierarchical clustering through the optional
argument clustering
. Using the latent formulation for
mixture models, we denote with [zi]h
the group allocation of the i-th unit at the h-th iteration. In such a way, a
co-association matrix C for
the N statistical units is
built, where the single cell cip is
the fraction of times the unit i and the unit p belong to the same group along the
H MCMC iterations:
$$c_{ip} = \frac{n_{ip}}{H}=\frac{1}{H} \sum_{h=1}^{H}|[z_i]_h=[z_p]_h|,$$ where |⋅| denotes the event indicator and nip is the number of times the units i, p belong to the same group along the sampling.
Extracts the pivots, one for each group, which
are (pairwise) separated units with (posterior) probability one (that
is, the posterior probability of any two of them being in the same group
is approximately zero). We denote them by i1, …, ik.
The user may choose among four procedures for extracting the pivotal
units with the optional argument piv.criterion
. For group
j containing Jj units, one
can choose:
i* that
maximizes ∑p ∈ Jjcip
("maxsumint"
, default method);
i* that
maximizes ∑p ∈ Jjcip − ∑p ∉ Jjcip
("maxsumdiff"
);
i* that
minimizes ∑p ∉ Jjcip
("minsumnoint"
).
These three methods are applied by the internal function
piv_sel
. Alternatively, when k < 5, the user can set
piv.criterion="MUS"
(Egidi et al.
2018a) which performs a sequential search of identity submatrices
within the matrix C via the
internal function MUS
.
The function piv_rel
:
where μ = (μ1, μ2, …, μk) is the vector of the means parameters and z = (z1, z2, …, zN) an i.i.d. vector of latent variables taking values in {1, 2, …, k} and denoting the group membership of each statistical unit.
piv_rel
takes as input the MCMC output from
piv_MCMC
and returns the relabelled chains and the
corresponding posterior estimates.
Suppose now that yi ∈ ℝ2 and assume that:
$$\boldsymbol{y}_i \sim
\sum_{j=1}^{k}\eta_{j}\mathcal{N}_{2}(\boldsymbol{\mu}_{j},
\boldsymbol{\Sigma})$$ where μj is
the mean vector for group j,
Σ is a
positive-definite covariance matrix and the mixture weight ηj = P(zi = j)
as for the one-dimensional case. We may generate Gaussian mixture data
through the function piv_sim
, specifying the sample size
N, the desired number of
groups k and the k × 2 matrix for the k mean vectors. The argument
W
handles the weights for a nested mixture, in which the
j-th component is in turn
modelled as a two-component mixture, with covariance matrices Σp1, Σp2,
respectively.
set.seed(500)
N <- 200
k <- 3
D <- 2
nMC <- 2000
M1 <- c(-10,8)
M2 <- c(10,.1)
M3 <- c(30,8)
# matrix of input means
Mu <- rbind(M1,M2,M3)
# covariance matrices for the two subgroups
Sigma.p1 <- diag(D)
Sigma.p2 <- (10^2)*diag(D)
# subgroups' weights
W <- c(0.2,0.8)
# simulate data
sim <- piv_sim(N = N, k = k, Mu = Mu,
Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W)
The function piv_MCMC
requires only three mandatory
arguments: the data object y
, the number of components
k
and the number of MCMC iterations, nMC
. By
default, it performs Gibbs sampling using the runjags
package. If software="rjags"
, for bivariate data the priors
are specified as:
where α is a k-dimensional vector and S2 and S3 are positive definite
matrices. By default, μ0 = 0,
α = (1, …, 1) and
S2 and S3 are diagonal matrices,
with diagonal elements equal to 1e+05. Different values can be specified
for the hyperparameters μ0, S2, S3
and α:
priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)}
,
with the constraint for S2 and
S3 to be positive definite,
and α a vector of
dimension k with nonnegative
elements.
If software="rstan"
, the function performs Hamiltonian
Monte Carlo (HMC) sampling. In this case the priors are specified
as:
The covariance matrix is expressed in terms of the LDL decomposition
as LD*LT,
a variant of the classical Cholesky decomposition, where L is a 2 × 2 lower unit triangular matrix and D* is a 2 × 2 diagonal matrix. The Cholesky
correlation factor L is
assigned a LKJ prior with ϵ
degrees of freedom, which, combined with priors on the standard
deviations of each component, induces a prior on the covariance matrix;
as ϵ → ∞ the magnitude of
correlations between components decreases, whereas ϵ = 1 leads to a uniform prior
distribution for L. By
default, the hyperparameters are μ0 = 0,
σd = 2.5, ϵ = 1.
Different values can be chosen with the argument:
priors=list(mu_0=c(1,2), sigma_d = 4, epsilon = 2)
.
We fit the model using rjags
with 2000 MCMC iterations
and default priors:
res <- piv_MCMC(y = sim$y, k= k, nMC =nMC,
piv.criterion = "maxsumdiff")
#> Compiling rjags model...
#> Calling the simulation using the rjags method...
#> Note: the model did not require adaptation
#> Burning in the model for 1000 iterations...
#> Running the model for 2000 iterations...
#> Simulation complete
#> Note: Summary statistics were not produced as there are >50 monitored
#> variables
#> [To override this behaviour see ?add.summary and ?runjags.options]
#> FALSEFinished running the simulation
#> Calculating summary statistics...
#> Calculating the Gelman-Rubin statistic for 13 variables....
#> Note: Unable to calculate the multivariate psrf
#>
#> JAGS model summary statistics from 8000 samples (chains = 4; adapt+burnin = 2000):
#>
#> Lower95 Median Upper95 Mean SD Mode MCerr
#> mu[1,1] -14.343 -11.136 9.1136 -8.8402 6.231 -- 0.29258
#> mu[2,1] -13.986 9.6654 34.629 10.775 65.87 -- 1.5044
#> mu[3,1] 7.7701 30.33 34.192 25.702 9.1842 -- 0.40985
#> mu[1,2] 0.50365 6.6321 9.8362 6.1283 2.3399 -- 0.045983
#> mu[2,2] -12.795 1.3264 9.7896 1.2133 64.74 -- 1.2724
#> mu[3,2] -0.87714 5.0132 8.8753 4.5723 2.6562 -- 0.073429
#> tau[1,1] 0.006311 0.017749 0.023976 0.017049 0.0045357 -- 0.00021408
#> tau[2,1] -0.0027099 0.00075815 0.0043552 0.00075746 0.001809 -- 0.000055876
#> tau[1,2] -0.0027099 0.00075815 0.0043552 0.00075746 0.001809 -- 0.000055876
#> tau[2,2] 0.0086464 0.011031 0.013847 0.011174 0.0013948 -- 0.000032515
#> eta[1] 0.21445 0.34258 0.59209 0.3644 0.093704 -- 0.0035267
#> eta[2] 8.4891e-06 0.37217 0.48482 0.33774 0.12236 -- 0.0042874
#> eta[3] 0.17376 0.27011 0.47148 0.29786 0.086707 -- 0.0026089
#>
#> MC%ofSD SSeff AC.10 psrf
#> mu[1,1] 4.7 454 0.49328 1.8804
#> mu[2,1] 2.3 1917 0.29832 1.3093
#> mu[3,1] 4.5 502 0.49684 3.6855
#> mu[1,2] 2 2589 0.21361 1.5507
#> mu[2,2] 2 2589 0.24792 1.2906
#> mu[3,2] 2.8 1308 0.2364 1.7746
#> tau[1,1] 4.7 449 0.4167 1.1463
#> tau[2,1] 3.1 1048 0.14735 1.0111
#> tau[1,2] 3.1 1048 0.14735 1.0111
#> tau[2,2] 2.3 1840 0.15556 1.0179
#> eta[1] 3.8 706 0.38422 1.4652
#> eta[2] 3.5 815 0.376 1.486
#> eta[3] 3 1105 0.3325 1.9083
#>
#> Total time taken: 8.6 seconds
Once we obtain posterior estimates, label switching is likely to
occurr. For such a reason, we need to relabel our chains as explained
above. In order to relabel the chains, the function piv_rel
can be used, which only needs the mcmc = res
argument.
Relabelled outputs can be displayed through the function
piv_plot
, with different options for the argument
type
:
chains
: plot the relabelled chains;hist
: plot the point estimates against the histogram of
the data.The optional argument par
takes four possible
alternative choices: mean
, sd
,
weight
and all
for the means, standard
deviations, weights or all the three mentioned parameters, respectively.
By default, par="all"
.
rel <- piv_rel(mcmc=res)
piv_plot(y = sim$y, mcmc = res, rel_est = rel, par = "mean", type = "chains")
#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters (coordinate 1 and 2). Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y = sim$y, mcmc = res, rel_est = rel, type = "hist")
#> Description: 3d histogram of the data along with the posterior estimates of the relabelled means (triangle points)
The Fishery dataset has been previously used by Titterington, Smith, and Makov (1985) and Papastamoulis (2016) and consists of 256 snapper
length measurements. It is contained in the bayesmix
package (Grün 2011). We may display the
histogram of the data, along with an estimated kernel density.
data(fish)
y <- fish[,1]
hist(y, breaks=40, prob = TRUE, cex.lab=1.6,
main ="Fishery data", cex.main =1.7,
col="navajowhite1", border="navajowhite1")
lines(density(y), lty=1, lwd=3, col="blue")
We assume a mixture model with k = 5 groups:
where μj, σj are the mean and the standard deviation of group j, respectively. Moreover, assume that z1, …, zn is an unobserved latent sequence of i.i.d. random variables following the multinomial distribution with weights η = (η1, …, ηk), such that:
P(zi = j) = ηj, where ηj is the mixture weight assigned to the group j.
We fit our model by simulating H = 15000 samples from the posterior
distribution of (z, μ, σ, η).
In the univariate case, if the argument software="rjags"
is
selected (the default option), Gibbs sampling is performed by the
package bayesmix
, and the priors are:
with default values: B0 = 0.1, ν0 = 20, g0 = 10−16,
G0 = 10−16,
α = (1, 1, …, 1). The
users may specify their own hyperparameters with the priors
arguments, declaring a names list such as:
priors = list(mu_0=2, alpha = rep(2, k), ...)
.
If software="rstan"
is selected, the priors are:
where the vector of the weights η = (η1, …, ηk)
is a k-simplex. Default
hyperparameters values are: μ0 = 0, B0inv = 0.1, μσ = 0, τσ = 2.
Here also, the users may choose their own hyperparameters values in the
following way:
priors = list(mu_sigma = 0, tau_sigma = 1,...)
.
We fit the model using the rjags
method, and we set the
burnin period to 7500.
k <- 5
nMC <- 15000
res <- piv_MCMC(y = y, k = k, nMC = nMC,
burn = 0.5*nMC, software = "rjags")
#> Compiling model graph
#> Declaring variables
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 256
#> Unobserved stochastic nodes: 268
#> Total graph size: 1050
#>
#> Initializing model
#>
#>
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#>
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501
#> End = 22500
#> Thinning interval = 1
#>
#> Empirical mean, standard deviation and 95% CI for eta
#> Mean SD 2.5% 97.5%
#> eta[1] 0.2030 0.20230 0.009156 0.5672
#> eta[2] 0.2006 0.13765 0.009236 0.5198
#> eta[3] 0.1674 0.12353 0.018239 0.5051
#> eta[4] 0.1217 0.05231 0.071350 0.1988
#> eta[5] 0.3073 0.22254 0.009916 0.5752
#>
#> Empirical mean, standard deviation and 95% CI for mu
#> Mean SD 2.5% 97.5%
#> mu[1] 8.386 2.7806 3.377 12.341
#> mu[2] 8.413 2.1134 5.178 12.347
#> mu[3] 8.369 1.5851 5.201 10.862
#> mu[4] 3.472 0.6172 3.126 5.368
#> mu[5] 7.358 2.7863 5.069 12.292
#>
#> Empirical mean, standard deviation and 95% CI for sigma2
#> Mean SD 2.5% 97.5%
#> sigma2[1] 0.4473 0.2510 0.1932 1.1808
#> sigma2[2] 0.4387 0.2108 0.2009 1.0234
#> sigma2[3] 0.4526 0.2415 0.2021 1.1223
#> sigma2[4] 0.2606 0.1128 0.1280 0.5479
#> sigma2[5] 0.4132 0.2442 0.2035 1.1182
First of all, we may access the true number of iterations by tiping:
We may have a glimpse if label switching ocurred or not by looking at
the traceplot for the mean parameters, μj. To do this,
we apply the function piv_rel
to relabel the chains and
obtain useful inferences; the only argument for this function is the
MCMC result just obtained with piv_MCMC
. The function
piv_plot
displays some graphical tools, both traceplots
(argument type="chains"
) and histograms along with the
final relabelled means (argument type="hist"
). For both
plot ttpes, the function returns a printed message explaining how to
interpret the results.
#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters. Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y=y, res, rel, type="hist")
#> Description: histograms of the data along with the estimated posterior means (red points) from raw MCMC and relabelling algorithm. The blue line is the estimated density curve.
The first plot displays the traceplots for the parameters μ. From the left plot
showing the raw outputs as given by the Gibbs sampling, we note that
label switching clearly occurred. Our algorithm seems able to reorder
the mean μj and the
weights ηj, for j = 1, …, k. Of course, a
MCMC sampler which does not switch the labels would ideal, but nearly
impossible to program. However, we could assess how two diferent sampler
perform, by repeating the analysis above by selecting
software="rstan"
in the piv_MCMC
function.
# stan code not evaluated here
res2 <- piv_MCMC(y = y, k = k, nMC = 3000,
software = "rstan")
rel2 <- piv_rel(res2)
piv_plot(y=y, res2, rel2, par = "mean", type="chains")
With the rstan
option, we can use the
bayesplot
functions on the argument:
# stan code not evaluated here
posterior <- as.array(res2$stanfit)
mcmc_intervals(posterior, regex_pars = c("mu"))
Regardless of the software that we chose, we may extract the JAGS/Stan model by typing:
cat(res$model)
#> var
#> b0,
#> B0inv,
#> nu0Half,
#> g0Half,
#> g0G0Half,
#> k,
#> N,
#> eta[5],
#> mu[5],
#> tau[5],
#> nu0S0Half,
#> S0,
#> e[5],
#> y[256],
#> S[256];
#>
#> model {
#> for (i in 1:N) {
#> y[i] ~ dnorm(mu[S[i]],tau[S[i]]);
#> S[i] ~ dcat(eta[]);
#> }
#> for (j in 1:k) {
#> mu[j] ~ dnorm(b0,B0inv);
#> tau[j] ~ dgamma(nu0Half,nu0S0Half);
#> }
#> S0 ~ dgamma(g0Half,g0G0Half);
#> nu0S0Half <- nu0Half * S0;
#>
#> eta[] ~ ddirch(e[]);
#> }
In order to estimate the number of clusters in the data, we can now fit sparse finite mixtures as proposed by Frühwirth-Schnatter and Malsiner-Walli (2019) by assuming:
η ∼ Dirichlet(e0),
where the smaller e0 and the the smaller the number of clusters a-posteriori. The function allows for a Gamma prior on e0 with hyperparameters ae, be that may be chosen by the users.
res3 <- piv_MCMC(y = y, k = k, nMC = nMC, sparsity = TRUE,
priors = list(alpha = rep(0.001, k))) # sparse on eta
#> Compiling model graph
#> Declaring variables
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 256
#> Unobserved stochastic nodes: 268
#> Total graph size: 1050
#>
#> Initializing model
#>
#>
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#>
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501
#> End = 22500
#> Thinning interval = 1
#>
#> Empirical mean, standard deviation and 95% CI for eta
#> Mean SD 2.5% 97.5%
#> eta[1] 3.674e-05 0.0008714 0.00000 1.086e-12
#> eta[2] 7.515e-01 0.1190594 0.48700 9.288e-01
#> eta[3] 3.925e-04 0.0058635 0.00000 3.997e-11
#> eta[4] 2.480e-01 0.1190932 0.07107 5.130e-01
#> eta[5] 2.549e-05 0.0007312 0.00000 2.014e-14
#>
#> Empirical mean, standard deviation and 95% CI for mu
#> Mean SD 2.5% 97.5%
#> mu[1] 5.677 3.1578 -0.4744 11.816
#> mu[2] 5.425 0.2327 4.9587 5.846
#> mu[3] 5.613 3.1466 -0.5620 11.721
#> mu[4] 8.394 0.8106 6.9791 10.047
#> mu[5] 5.633 3.1208 -0.4693 11.684
#>
#> Empirical mean, standard deviation and 95% CI for sigma2
#> Mean SD 2.5% 97.5%
#> sigma2[1] 2.438 1.1416 0.9608 5.312
#> sigma2[2] 1.814 0.3876 1.1119 2.623
#> sigma2[3] 2.436 1.1367 0.9649 5.276
#> sigma2[4] 3.002 0.8899 1.5451 4.944
#> sigma2[5] 2.440 1.1286 0.9741 5.291
#> [1] "MCMC has not never been able to identify the required number of groups and the process has been interrupted"
barplot(table(res3$nclusters), xlab= expression(K["+"]),
col = "blue", border = "red", main = expression(paste("p(",K["+"], "|y)")),
cex.main=3, yaxt ="n", cex.axis=2.4, cex.names=2.4,
cex.lab=2)