Dealing with label switching: relabelling in Bayesian mixture models by pivotal units

In this vignette we explore the fit of Gaussian mixture models and the relabelling pivotal method proposed in Egidi et al. (2018b) through the pivmet package. First of all, we load the package:

library(pivmet)
library(bayesmix)
library(bayesplot)

The pivmet package provides a simple framework to (i) fit univariate and multivariate mixture models according to a Bayesian flavour and detect the pivotal units, via the piv_MCMC function; (ii) perform the relabelling step via the piv_rel function.

There are two main functions used for this task.

The function piv_MCMC:

  • performs MCMC sampling for Gaussian mixture models using the underlying rjags or rstan packages (chosen by the users through the optional function argument software, by default set to rjags). Precisely the package uses: the JAGSrun function of the bayesmix package for univariate mixture models; the run.jags function of the runjags package for bivariate mixture models; the stan function of the rstan package for both univariate and bivariate mixture models.

  • Post-processes the chains and randomly switches their values (Frühwirth-Schnatter 2001).

  • Builds a co-association matrix for the N units. After H MCMC iterations, the function implements a clustering procedure with k groups, the user may choose among agglomerative or divisive hierarchical clustering through the optional argument clustering. Using the latent formulation for mixture models, we denote with [zi]h the group allocation of the i-th unit at the h-th iteration. In such a way, a co-association matrix C for the N statistical units is built, where the single cell cip is the fraction of times the unit i and the unit p belong to the same group along the H MCMC iterations:

$$c_{ip} = \frac{n_{ip}}{H}=\frac{1}{H} \sum_{h=1}^{H}|[z_i]_h=[z_p]_h|,$$ where |⋅| denotes the event indicator and nip is the number of times the units ip belong to the same group along the sampling.

  • Extracts the pivots, one for each group, which are (pairwise) separated units with (posterior) probability one (that is, the posterior probability of any two of them being in the same group is approximately zero). We denote them by i1, …, ik. The user may choose among four procedures for extracting the pivotal units with the optional argument piv.criterion. For group j containing Jj units, one can choose:

  • i* that maximizes p ∈ Jjcip ("maxsumint", default method);

  • i* that maximizes p ∈ Jjcip − ∑p ∉ Jjcip ("maxsumdiff");

  • i* that minimizes p ∉ Jjcip ("minsumnoint").

These three methods are applied by the internal function piv_sel. Alternatively, when k < 5, the user can set piv.criterion="MUS" (Egidi et al. 2018a) which performs a sequential search of identity submatrices within the matrix C via the internal function MUS.

The function piv_rel:

  • performs the relabelling algorithm using the k pivotal units as groups identifiers. The pivotal units previously detected play a central role, yielding to the following relabelling for the h-th iteration:

where μ = (μ1, μ2, …, μk) is the vector of the means parameters and z = (z1, z2, …, zN) an i.i.d. vector of latent variables taking values in {1, 2, …, k} and denoting the group membership of each statistical unit.

piv_rel takes as input the MCMC output from piv_MCMC and returns the relabelled chains and the corresponding posterior estimates.

Example: bivariate Gaussian data

Suppose now that yi ∈ ℝ2 and assume that:

$$\boldsymbol{y}_i \sim \sum_{j=1}^{k}\eta_{j}\mathcal{N}_{2}(\boldsymbol{\mu}_{j}, \boldsymbol{\Sigma})$$ where μj is the mean vector for group j, Σ is a positive-definite covariance matrix and the mixture weight ηj = P(zi = j) as for the one-dimensional case. We may generate Gaussian mixture data through the function piv_sim, specifying the sample size N, the desired number of groups k and the k × 2 matrix for the k mean vectors. The argument W handles the weights for a nested mixture, in which the j-th component is in turn modelled as a two-component mixture, with covariance matrices Σp1, Σp2, respectively.

set.seed(500)
N  <- 200
k  <- 3
D <- 2
nMC <- 2000
M1 <- c(-10,8)
M2 <- c(10,.1)
M3 <- c(30,8)
# matrix of input means
Mu <- rbind(M1,M2,M3)
# covariance matrices for the two subgroups
Sigma.p1 <- diag(D)
Sigma.p2 <- (10^2)*diag(D)
# subgroups' weights
W   <- c(0.2,0.8)
# simulate data
sim <- piv_sim(N = N, k = k, Mu = Mu,
 Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W)

The function piv_MCMC requires only three mandatory arguments: the data object y, the number of components k and the number of MCMC iterations, nMC. By default, it performs Gibbs sampling using the runjags package. If software="rjags", for bivariate data the priors are specified as:

where α is a k-dimensional vector and S2 and S3 are positive definite matrices. By default, μ0 = 0, α = (1, …, 1) and S2 and S3 are diagonal matrices, with diagonal elements equal to 1e+05. Different values can be specified for the hyperparameters μ0, S2, S3 and α: priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)}, with the constraint for S2 and S3 to be positive definite, and α a vector of dimension k with nonnegative elements.

If software="rstan", the function performs Hamiltonian Monte Carlo (HMC) sampling. In this case the priors are specified as:

The covariance matrix is expressed in terms of the LDL decomposition as LD*LT, a variant of the classical Cholesky decomposition, where L is a 2 × 2 lower unit triangular matrix and D* is a 2 × 2 diagonal matrix. The Cholesky correlation factor L is assigned a LKJ prior with ϵ degrees of freedom, which, combined with priors on the standard deviations of each component, induces a prior on the covariance matrix; as ϵ → ∞ the magnitude of correlations between components decreases, whereas ϵ = 1 leads to a uniform prior distribution for L. By default, the hyperparameters are μ0 = 0, σd = 2.5, ϵ = 1. Different values can be chosen with the argument: priors=list(mu_0=c(1,2), sigma_d = 4, epsilon = 2).

We fit the model using rjags with 2000 MCMC iterations and default priors:

res <- piv_MCMC(y = sim$y, k= k, nMC =nMC, 
                piv.criterion = "maxsumdiff")
#> Compiling rjags model...
#> Calling the simulation using the rjags method...
#> Note: the model did not require adaptation
#> Burning in the model for 1000 iterations...
#> Running the model for 2000 iterations...
#> Simulation complete
#> Note: Summary statistics were not produced as there are >50 monitored
#> variables
#> [To override this behaviour see ?add.summary and ?runjags.options]
#> FALSEFinished running the simulation
#> Calculating summary statistics...
#> Calculating the Gelman-Rubin statistic for 13 variables....
#> 
#> JAGS model summary statistics from 8000 samples (chains = 4; adapt+burnin = 2000):
#>                                                                              
#>             Lower95     Median   Upper95      Mean        SD Mode       MCerr
#> mu[1,1]     -14.135    -11.156    10.605   -7.2316    8.4021   --     0.29515
#> mu[2,1]     -13.142     8.6973    12.232    4.7909    8.7763   --     0.37256
#> mu[3,1]       27.94     31.344    34.531    31.218      1.71   --    0.053275
#> mu[1,2]    -0.66507     6.4625    9.3868    5.6309    2.8147   --    0.054661
#> mu[2,2]     -2.3335     1.0361    8.8759    1.6809    3.2758   --    0.083314
#> mu[3,2]      2.4337     5.6202    9.1944    5.6425    1.7334   --    0.041323
#> tau[1,1]   0.011142   0.018278   0.02511  0.018071 0.0035245   --  0.00013041
#> tau[2,1] -0.0027288 0.00073078 0.0044106 0.0008006 0.0018083   -- 0.000053753
#> tau[1,2] -0.0027288 0.00073078 0.0044106 0.0008006 0.0018083   -- 0.000053753
#> tau[2,2]  0.0086119   0.010997    0.0137  0.011128 0.0014085   -- 0.000026624
#> eta[1]      0.23767    0.34305   0.49384   0.35462  0.071855   --   0.0025272
#> eta[2]      0.25762    0.40613   0.52024   0.39523  0.076722   --   0.0023231
#> eta[3]      0.17945    0.24846   0.32666   0.25015  0.037895   --  0.00081852
#>                                       
#>          MC%ofSD SSeff    AC.10   psrf
#> mu[1,1]      3.5   810  0.40177 3.1435
#> mu[2,1]      4.2   555   0.3735 3.0259
#> mu[3,1]      3.1  1030  0.20395 1.0128
#> mu[1,2]      1.9  2652  0.18752 2.3172
#> mu[2,2]      2.5  1546  0.36457 1.5882
#> mu[3,2]      2.4  1760 0.095081 1.0036
#> tau[1,1]     3.7   730  0.26706 1.0139
#> tau[2,1]       3  1132  0.14037 1.0156
#> tau[1,2]       3  1132  0.14037 1.0156
#> tau[2,2]     1.9  2799  0.18954  1.022
#> eta[1]       3.5   808  0.31168 1.4705
#> eta[2]         3  1091  0.30329 1.4358
#> eta[3]       2.2  2143 0.044233 1.0014
#> 
#> Total time taken: 8.1 seconds

Once we obtain posterior estimates, label switching is likely to occurr. For such a reason, we need to relabel our chains as explained above. In order to relabel the chains, the function piv_rel can be used, which only needs the mcmc = res argument. Relabelled outputs can be displayed through the function piv_plot, with different options for the argument type:

  • chains: plot the relabelled chains;
  • hist: plot the point estimates against the histogram of the data.

The optional argument par takes four possible alternative choices: mean, sd, weight and all for the means, standard deviations, weights or all the three mentioned parameters, respectively. By default, par="all".

rel <- piv_rel(mcmc=res)
piv_plot(y = sim$y, mcmc = res, rel_est = rel, par = "mean", type = "chains")
#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters (coordinate 1 and 2). Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y = sim$y, mcmc = res, rel_est = rel, type = "hist")
#> Description: 3d histogram of the data along with the posterior estimates of the relabelled means (triangle points)

Example: fishery data

The Fishery dataset has been previously used by Titterington, Smith, and Makov (1985) and Papastamoulis (2016) and consists of 256 snapper length measurements. It is contained in the bayesmix package (Grün 2011). We may display the histogram of the data, along with an estimated kernel density.

data(fish)
y <- fish[,1]
hist(y, breaks=40, prob = TRUE, cex.lab=1.6,
             main ="Fishery data", cex.main =1.7,
             col="navajowhite1", border="navajowhite1")
 lines(density(y), lty=1, lwd=3, col="blue")

We assume a mixture model with k = 5 groups:

where μj, σj are the mean and the standard deviation of group j, respectively. Moreover, assume that z1, …, zn is an unobserved latent sequence of i.i.d. random variables following the multinomial distribution with weights η = (η1, …, ηk), such that:

P(zi = j) = ηj, where ηj is the mixture weight assigned to the group j.

We fit our model by simulating H = 15000 samples from the posterior distribution of (z, μ, σ, η). In the univariate case, if the argument software="rjags" is selected (the default option), Gibbs sampling is performed by the package bayesmix, and the priors are:

with default values: B0 = 0.1, ν0 = 20, g0 = 10−16, G0 = 10−16, α = (1, 1, …, 1). The users may specify their own hyperparameters with the priors arguments, declaring a names list such as: priors = list(mu_0=2, alpha = rep(2, k), ...).

If software="rstan" is selected, the priors are:

where the vector of the weights η = (η1, …, ηk) is a k-simplex. Default hyperparameters values are: μ0 = 0, B0inv = 0.1, μσ = 0, τσ = 2. Here also, the users may choose their own hyperparameters values in the following way: priors = list(mu_sigma = 0, tau_sigma = 1,...).

We fit the model using the rjags method, and we set the burnin period to 7500.

k <- 5
nMC <- 15000
res <- piv_MCMC(y = y, k = k, nMC = nMC, 
                burn = 0.5*nMC, software = "rjags")
#> Compiling model graph
#>    Declaring variables
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 256
#>    Unobserved stochastic nodes: 268
#>    Total graph size: 1050
#> 
#> Initializing model
#> 
#> 
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#> 
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501 
#> End = 22500 
#> Thinning interval = 1 
#> 
#>  Empirical mean, standard deviation and 95% CI for eta 
#>          Mean      SD     2.5%  97.5%
#> eta[1] 0.2030 0.20230 0.009156 0.5672
#> eta[2] 0.2006 0.13765 0.009236 0.5198
#> eta[3] 0.1674 0.12353 0.018239 0.5051
#> eta[4] 0.1217 0.05231 0.071350 0.1988
#> eta[5] 0.3073 0.22254 0.009916 0.5752
#> 
#>  Empirical mean, standard deviation and 95% CI for mu 
#>        Mean     SD  2.5%  97.5%
#> mu[1] 8.386 2.7806 3.377 12.341
#> mu[2] 8.413 2.1134 5.178 12.347
#> mu[3] 8.369 1.5851 5.201 10.862
#> mu[4] 3.472 0.6172 3.126  5.368
#> mu[5] 7.358 2.7863 5.069 12.292
#> 
#>  Empirical mean, standard deviation and 95% CI for sigma2 
#>             Mean     SD   2.5%  97.5%
#> sigma2[1] 0.4473 0.2510 0.1932 1.1808
#> sigma2[2] 0.4387 0.2108 0.2009 1.0234
#> sigma2[3] 0.4526 0.2415 0.2021 1.1223
#> sigma2[4] 0.2606 0.1128 0.1280 0.5479
#> sigma2[5] 0.4132 0.2442 0.2035 1.1182

First of all, we may access the true number of iterations by tiping:

res$true.iter
#> [1] 7421

We may have a glimpse if label switching ocurred or not by looking at the traceplot for the mean parameters, μj. To do this, we apply the function piv_rel to relabel the chains and obtain useful inferences; the only argument for this function is the MCMC result just obtained with piv_MCMC. The function piv_plot displays some graphical tools, both traceplots (argument type="chains") and histograms along with the final relabelled means (argument type="hist"). For both plot ttpes, the function returns a printed message explaining how to interpret the results.

rel <- piv_rel(mcmc=res)
piv_plot(y=y, res, rel, par = "mean", type="chains")

#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters. Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y=y, res, rel, type="hist")

#> Description: histograms of the data along with the estimated posterior means (red points) from raw MCMC and relabelling algorithm. The blue line is the estimated density curve.

The first plot displays the traceplots for the parameters μ. From the left plot showing the raw outputs as given by the Gibbs sampling, we note that label switching clearly occurred. Our algorithm seems able to reorder the mean μj and the weights ηj, for j = 1, …, k. Of course, a MCMC sampler which does not switch the labels would ideal, but nearly impossible to program. However, we could assess how two diferent sampler perform, by repeating the analysis above by selecting software="rstan" in the piv_MCMC function.

# stan code not evaluated here
res2 <- piv_MCMC(y = y, k = k, nMC = 3000, 
                 software = "rstan")
rel2 <- piv_rel(res2)
piv_plot(y=y, res2, rel2, par = "mean", type="chains")

With the rstan option, we can use the bayesplot functions on the argument:

# stan code not evaluated here
posterior <- as.array(res2$stanfit)
mcmc_intervals(posterior, regex_pars = c("mu"))

Regardless of the software that we chose, we may extract the JAGS/Stan model by typing:

cat(res$model)
#> var 
#>  b0,
#>      B0inv,
#>      nu0Half,
#>      g0Half,
#>      g0G0Half,
#>      k,
#>      N,
#>      eta[5], 
#>  mu[5], 
#>  tau[5], 
#>  nu0S0Half,
#>      S0,
#>      e[5], 
#>  y[256], 
#>  S[256]; 
#> 
#> model    {
#>  for (i in 1:N) {
#>      y[i] ~ dnorm(mu[S[i]],tau[S[i]]);
#>      S[i] ~ dcat(eta[]);
#>  }
#>  for (j in 1:k) {
#>      mu[j] ~ dnorm(b0,B0inv);
#>      tau[j] ~ dgamma(nu0Half,nu0S0Half);
#>  }
#>  S0 ~ dgamma(g0Half,g0G0Half);
#>  nu0S0Half <- nu0Half * S0;
#> 
#>  eta[] ~ ddirch(e[]);
#> }

In order to estimate the number of clusters in the data, we can now fit sparse finite mixtures as proposed by Frühwirth-Schnatter and Malsiner-Walli (2019) by assuming:

η ∼ Dirichlet(e0),

where the smaller e0 and the the smaller the number of clusters a-posteriori. The function allows for a Gamma prior on e0 with hyperparameters ae, be that may be chosen by the users.

res3 <- piv_MCMC(y = y, k = k, nMC = nMC, sparsity = TRUE,
                 priors = list(alpha = rep(0.001, k))) # sparse on eta
#> Compiling model graph
#>    Declaring variables
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 256
#>    Unobserved stochastic nodes: 268
#>    Total graph size: 1050
#> 
#> Initializing model
#> 
#> 
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#> 
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501 
#> End = 22500 
#> Thinning interval = 1 
#> 
#>  Empirical mean, standard deviation and 95% CI for eta 
#>             Mean        SD    2.5%     97.5%
#> eta[1] 3.674e-05 0.0008714 0.00000 1.086e-12
#> eta[2] 7.515e-01 0.1190594 0.48700 9.288e-01
#> eta[3] 3.925e-04 0.0058635 0.00000 3.997e-11
#> eta[4] 2.480e-01 0.1190932 0.07107 5.130e-01
#> eta[5] 2.549e-05 0.0007312 0.00000 2.014e-14
#> 
#>  Empirical mean, standard deviation and 95% CI for mu 
#>        Mean     SD    2.5%  97.5%
#> mu[1] 5.677 3.1578 -0.4744 11.816
#> mu[2] 5.425 0.2327  4.9587  5.846
#> mu[3] 5.613 3.1466 -0.5620 11.721
#> mu[4] 8.394 0.8106  6.9791 10.047
#> mu[5] 5.633 3.1208 -0.4693 11.684
#> 
#>  Empirical mean, standard deviation and 95% CI for sigma2 
#>            Mean     SD   2.5% 97.5%
#> sigma2[1] 2.438 1.1416 0.9608 5.312
#> sigma2[2] 1.814 0.3876 1.1119 2.623
#> sigma2[3] 2.436 1.1367 0.9649 5.276
#> sigma2[4] 3.002 0.8899 1.5451 4.944
#> sigma2[5] 2.440 1.1286 0.9741 5.291
#> [1] "MCMC has not never been able to identify the required number of groups and the process has been interrupted"
barplot(table(res3$nclusters), xlab= expression(K["+"]),
        col = "blue", border = "red", main = expression(paste("p(",K["+"], "|y)")),
        cex.main=3, yaxt ="n", cex.axis=2.4, cex.names=2.4,
        cex.lab=2)

References

Egidi, Leonardo, Roberta Pappadà, Francesco Pauli, and Nicola Torelli. 2018a. “Maxima Units Search (MUS) Algorithm: Methodology and Applications.” In Studies in Theoretical and Applied Statistics, edited by C. Perna, M. Pratesi, and A. Ruiz-Gazen, 71–81. Springer.
———. 2018b. “Relabelling in Bayesian Mixture Models by Pivotal Units.” Statistics and Computing 28 (4): 957–69.
Frühwirth-Schnatter, Sylvia. 2001. “Markov Chain Monte Carlo Estimation of Classical and Dynamic Switching and Mixture Models.” Journal of the American Statistical Association 96 (453): 194–209.
Frühwirth-Schnatter, Sylvia, and Gertraud Malsiner-Walli. 2019. “From Here to Infinity: Sparse Finite Versus Dirichlet Process Mixtures in Model-Based Clustering.” Advances in Data Analysis and Classification 13 (1): 33–64.
Grün, B. 2011. bayesmix: Bayesian Mixture Models with JAGS. R Package Version 0.7-2.” https://CRAN.R-project.org/package=bayesmix.
Papastamoulis, Panagiotis. 2016. label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs.” Journal of Statistical Software, Code Snippets 69 (1): 1–24. https://doi.org/10.18637/jss.v069.c01.
Titterington, D Michael, Adrian FM Smith, and Udi E Makov. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley.