---
title: "Dealing with label switching: relabelling in Bayesian mixture models by pivotal units"
author: "Leonardo Egidi"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ref.bib
vignette: >
  %\VignetteIndexEntry{Dealing with label switching: relabelling in Bayesian mixture models by pivotal units}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette we explore the fit of Gaussian mixture models and the relabelling pivotal method proposed in @egidi2018relabelling through the `pivmet` package. First of all, we load the package:


```{r load, warning =FALSE, message = FALSE}
library(pivmet)
library(bayesmix)
library(bayesplot)
```


The `pivmet` \code{R} package provides a simple framework to (i) fit univariate and multivariate mixture models according to a Bayesian flavour and detect the pivotal units, via the `piv_MCMC` function; (ii) perform the relabelling step via the `piv_rel` function.

There are two main functions used for this task.

The function `piv_MCMC`: 

  +  **performs MCMC sampling** for Gaussian mixture models using the underlying `rjags` or `rstan` packages (chosen by the users through the optional function argument `software`, by default set to `rjags`). Precisely the package uses:  the `JAGSrun` function of the `bayesmix` package for univariate mixture models; the `run.jags` function of the `runjags` package for bivariate mixture models; the `stan` function of the `rstan` package for both univariate and bivariate mixture models.
  
  + **Post-processes the chains** and randomly switches their values [@fruhwirth2001markov].
  
  + **Builds a co-association matrix for the $N$ units**.  After $H$ MCMC iterations, the function implements a clustering procedure with $k$ groups, the user may choose among agglomerative or divisive hierarchical clustering through the optional argument `clustering`. Using the latent formulation for mixture models, we denote with $[z_i]_h$ the group allocation of the $i$-th unit at the $h$-th iteration. In such a way, a co-association matrix $C$ for the $N$ statistical units is built, where the single cell $c_{ip}$ is the fraction of times the unit $i$ and the unit $p$ belong to the same group along the $H$ MCMC iterations:
  
$$c_{ip} = \frac{n_{ip}}{H}=\frac{1}{H} \sum_{h=1}^{H}|[z_i]_h=[z_p]_h|,$$
  where $|\cdot|$ denotes the event indicator and $n_{ip}$ is the number of times the units $i, \ p$ belong to the same group along the sampling.
  
  
 + **Extracts the pivots**, one for each group, which
are (pairwise) separated  units with (posterior) probability one (that
is, the posterior probability of any two of them being in the same group is approximately zero). We denote them by $i_1,\ldots,i_k$. The user may choose among four procedures for extracting the pivotal units with the optional argument `piv.criterion`. For group $j$ containing $J_j$ units, one can choose:

-  $i^{*}$ that maximizes $\sum_{p \in J_j}c_{ip}$ (`"maxsumint"`, default method);

- $i^{*}$ that maximizes $\sum_{p \in J_j}c_{ip}- \sum_{p \not\in J_j}c_{ip}$ (`"maxsumdiff"`);

- $i^{*}$ that minimizes $\sum_{p \not\in J_j}c_{ip}$ (`"minsumnoint"`). 

These three methods are applied by the internal function `piv_sel`. Alternatively, when $k <5$, the user can set `piv.criterion="MUS"` [@egidi2018mus] which performs a sequential search of identity submatrices within the matrix $C$ via the internal function `MUS`.
  
The function `piv_rel`: 
  
  + **performs the relabelling algorithm** using the $k$ pivotal units as groups identifiers. The pivotal units previously detected play a central role, yielding to the following relabelling for the $h$-th iteration:

\begin{align*} 
[\mu_{j}]_h=&[\mu_{z_{i_{j}}}]_h \\
[z_{i}]_h =j & \mbox{ for } i : [z_i]_h=[z_{i_{j}}]_h,\\
\end{align*}
  where $\boldsymbol{\mu}=(\mu_1,\mu_2,\ldots,\mu_k)$ is the vector of the means parameters and $\boldsymbol{z}=(z_1,z_2,\ldots,z_N)$ an i.i.d. vector of latent variables taking values in $\{1,2,\ldots,k \}$ and denoting the group membership of each statistical unit.

`piv_rel` takes as input the MCMC output from `piv_MCMC` and returns the relabelled chains and the corresponding posterior estimates.

## Example: bivariate Gaussian data

Suppose now that $\boldsymbol{y}_i \in \mathbb{R}^2$ and assume that: 

$$\boldsymbol{y}_i \sim \sum_{j=1}^{k}\eta_{j}\mathcal{N}_{2}(\boldsymbol{\mu}_{j}, \boldsymbol{\Sigma})$$
where $\boldsymbol{\mu}_j$ is the mean vector for group $j$, $\boldsymbol{\Sigma}$ is a positive-definite covariance matrix and the mixture weight $\eta_j= P(z_i=j)$ as for the one-dimensional case.
We may generate Gaussian mixture data through the function `piv_sim`, specifying the sample size $N$, the desired number of groups $k$ and the $k \times 2$ matrix for the $k$ mean vectors. The argument `W` handles the weights for a nested mixture, in which the $j$-th component is in turn modelled as a two-component mixture, with covariance matrices $\boldsymbol{\Sigma}_{p1}, \boldsymbol{\Sigma}_{p2}$, respectively.


```{r nested, fig.align ='center'}
set.seed(500)
N  <- 200
k  <- 3
D <- 2
nMC <- 2000
M1 <- c(-10,8)
M2 <- c(10,.1)
M3 <- c(30,8)
# matrix of input means
Mu <- rbind(M1,M2,M3)
# covariance matrices for the two subgroups
Sigma.p1 <- diag(D)
Sigma.p2 <- (10^2)*diag(D)
# subgroups' weights
W   <- c(0.2,0.8)
# simulate data
sim <- piv_sim(N = N, k = k, Mu = Mu,
 Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W)
```

The function ```piv_MCMC``` requires only three mandatory arguments: the data object ```y```, the number of components ```k``` and the number of MCMC iterations, ```nMC```. By default, it performs Gibbs sampling using the  ```runjags``` package. If `software="rjags"`, for bivariate data the priors are specified as:


\begin{align}
\boldsymbol{\mu}_j  \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, S_2)\\
 \Sigma^{-1} \sim & \mbox{Wishart}(S_3, 3)\\
\eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha}),
\end{align}

where  $\boldsymbol{\alpha}$ is a $k$-dimensional vector and $S_2$ and $S_3$ are positive definite matrices. By default, $\boldsymbol{\mu}_0=\boldsymbol{0}$, $\boldsymbol{\alpha}=(1,\ldots,1)$ and $S_2$ and $S_3$ are diagonal matrices,
with diagonal elements 
equal to 1e+05. Different values can be specified for the hyperparameters
$\boldsymbol{\mu}_0, S_2, S_3$ and $\boldsymbol{\alpha}$: `priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)}`,
with the constraint for $S2$ and $S3$ to be positive definite, and $\boldsymbol{\alpha}$ a vector of dimension $k$ with nonnegative elements.

If `software="rstan"`, the function performs Hamiltonian Monte Carlo (HMC) sampling. In this case the priors are specified as:

\begin{align}
 \boldsymbol{\mu}_j  \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, LD^*L^{T})\\
 L \sim & \mbox{LKJ}(\eta)\\
D_{1,2} \sim & \mbox{HalfCauchy}(0, \sigma_d).
 \end{align}

The covariance matrix is expressed in terms of the LDL decomposition as $LD^*L^{T}$,
a variant of the classical Cholesky decomposition, where $L$ is a $2 \times 2$
lower unit triangular matrix and $D^*$ is a $2 \times 2$ diagonal matrix.
 The Cholesky correlation factor $L$ is assigned a LKJ prior with $\epsilon$ degrees of freedom,  which, combined with priors on the standard deviations of each component, induces a prior on the covariance matrix; as $\epsilon \rightarrow \infty$ the magnitude of correlations between components decreases, whereas $\epsilon=1$ leads to a uniform prior distribution for $L$.  By default, the hyperparameters are $\boldsymbol{\mu}_0=\boldsymbol{0}$, $\sigma_d=2.5, \epsilon=1$.  Different values can be chosen with the argument: `priors=list(mu_0=c(1,2), sigma_d = 4, epsilon = 2)`.
 
We fit the model using `rjags` with 2000 MCMC iterations and default priors:


```{r mcmc, message =  FALSE, warning = FALSE}
res <- piv_MCMC(y = sim$y, k= k, nMC =nMC, 
                piv.criterion = "maxsumdiff")
```

Once we obtain posterior estimates, label switching is likely to occurr. For such a reason, we need to relabel our chains as explained above. In order to relabel the chains, the function `piv_rel` can be used, which only needs the `mcmc = res` argument.  Relabelled outputs can be displayed through the function `piv_plot`, with different options for the argument `type`:

- `chains`: plot the relabelled chains;
- `hist`: plot the point estimates against the histogram of the data.

The optional argument `par` takes four possible alternative choices: `mean`, `sd`, `weight` and `all` for the means, standard deviations, weights or all the three mentioned parameters, respectively. By default, `par="all"`.

```{r pivotal_rel, fig.show='hold', fig.align='center',fig.width=7}
rel <- piv_rel(mcmc=res)
piv_plot(y = sim$y, mcmc = res, rel_est = rel, par = "mean", type = "chains")
piv_plot(y = sim$y, mcmc = res, rel_est = rel, type = "hist")
```


  
## Example: fishery data 

The Fishery dataset has been previously used by @titterington1985statistical and @papastamoulis2016label 
 and consists of 256 snapper length measurements. It is contained in the ```bayesmix``` package [@grun2011bayesmix]. We may display the histogram of the data, along with an estimated kernel density.  
 
```{r fish_hist, fig.align ='center', fig.width=5.5}
data(fish)
y <- fish[,1]
hist(y, breaks=40, prob = TRUE, cex.lab=1.6,
             main ="Fishery data", cex.main =1.7,
             col="navajowhite1", border="navajowhite1")
 lines(density(y), lty=1, lwd=3, col="blue")
```

We assume a mixture model with $k=5$ groups:
 
 \begin{equation}
y_i \sim \sum_{j=1}^k \eta_j \mathcal{N}(\mu_j, \sigma^2_j), \ \ i=1,\ldots,n,
\label{eq:fishery} 
 \end{equation}
where $\mu_j, \sigma_j$ are the mean and the standard deviation of group $j$, respectively. Moreover, assume that $z_1,\ldots,z_n$ is an unobserved latent sequence of i.i.d. random variables following the multinomial distribution with weights  $\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{k})$, such that:

$$P(z_i=j)=\eta_j,$$
where $\eta_j$ is the mixture weight assigned to the group $j$.

We fit our model by simulating $H=15000$ samples from the posterior distribution of $(\boldsymbol{z}, \boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\eta})$. In the univariate case, if the argument ```software="rjags"``` is selected (the default option),  Gibbs sampling is performed by the package ```bayesmix```, and  the priors are:

\begin{eqnarray}
\mu_j \sim & \mathcal{N}(\mu_0, 1/B_0)\\
  \sigma_j \sim & \mbox{invGamma}(\nu_0/2, \nu_0S_0/2)\\
  \eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha})\\
  S_0 \sim & \mbox{Gamma}(g_0/2, g_0G_0/2),
\end{eqnarray}

with default values: $B_0=0.1$, $\nu_0 =20$, $g_0 = 10^{-16}$, $G_0 = 10^{-16}$, $\boldsymbol{\alpha}=(1,1,\ldots,1)$. The users may specify their own hyperparameters with the ```priors``` arguments, declaring a names list such as: ```priors = list(mu_0=2, alpha = rep(2, k), ...)```.

If ```software="rstan"``` is selected, the priors are:

\begin{eqnarray}
  \mu_j & \sim \mathcal{N}(\mu_0, 1/B0inv)\\
  \sigma_j & \sim \mbox{Lognormal}(\mu_{\sigma}, \tau_{\sigma})\\
  \eta_j & \sim \mbox{Uniform}(0,1),
  \end{eqnarray}

where the vector of the weights $\boldsymbol{\eta}=(\eta_1,\ldots,\eta_k)$ is a $k$-simplex.  Default hyperparameters values are: $\mu_0=0, B0inv=0.1, \mu_{\sigma}=0, \tau_{\sigma}=2$. Here also, the users may choose their own hyperparameters values in the following way: ```priors = list(mu_sigma = 0, tau_sigma = 1,...)```.

We fit the model using the ```rjags``` method, and we set the burnin period to 7500. 
 
```{r fish_data}
k <- 5
nMC <- 15000
res <- piv_MCMC(y = y, k = k, nMC = nMC, 
                burn = 0.5*nMC, software = "rjags")
```

First of all, we may access the true number of iterations by tiping:

```{r true_iter}
res$true.iter
```

We may have a glimpse if label switching ocurred or not by looking at the traceplot for the mean parameters, $\mu_j$. To do this, we apply the function ```piv_rel``` to relabel the chains and obtain useful inferences; the only argument for this function is the MCMC result just obtained with ```piv_MCMC```. The function ```piv_plot``` displays some graphical tools, both traceplots (argument ```type="chains"```) and histograms along with the final relabelled means (argument ```type="hist"```). For both plot ttpes, the function returns a printed message explaining how to interpret the results.

 
```{r fish_rel, fig.align= 'center', fig.width=7}
rel <- piv_rel(mcmc=res)
piv_plot(y=y, res, rel, par = "mean", type="chains")
piv_plot(y=y, res, rel, type="hist")
```

The first plot displays the traceplots for the parameters $\boldsymbol{\mu}$. From the left plot showing the raw outputs as given by the Gibbs sampling, we note that label switching clearly occurred. Our algorithm seems able to reorder the mean $\mu_j$ and the weights $\eta_j$, for $j=1,\ldots,k$. Of course, a MCMC sampler which does not switch the labels would ideal, but nearly impossible to program. However, we could assess how two diferent sampler perform, by repeating the analysis above by selecting ```software="rstan"``` in the ```piv_MCMC``` function.

```{r stan, eval = FALSE, fig.align= 'center', fig.width=7}
# stan code not evaluated here
res2 <- piv_MCMC(y = y, k = k, nMC = 3000, 
                 software = "rstan")
rel2 <- piv_rel(res2)
piv_plot(y=y, res2, rel2, par = "mean", type="chains")
```

With the `rstan` option, we can use the `bayesplot` functions on the \code{stanfit} argument:

```{r bayesplot, eval = FALSE, fig.align= 'center', fig.width=5}
# stan code not evaluated here
posterior <- as.array(res2$stanfit)
mcmc_intervals(posterior, regex_pars = c("mu"))
```

Regardless of the software that we chose, we may extract the JAGS/Stan model by typing:

```{r model_code}
cat(res$model)
```

In order to estimate the number of clusters in the data, we can now fit **sparse finite mixtures** as proposed by @fruhwirth2019here by assuming:

$$\boldsymbol{\eta} \sim \text{Dirichlet}(e_0),$$

where the smaller $e_0$ and the the smaller the number of clusters a-posteriori. The function allows for a Gamma prior on $e_0$ with hyperparameters $a_e, b_e$ that may be chosen by the users.

```{r sparsity,   message =FALSE, warning = FALSE}
res3 <- piv_MCMC(y = y, k = k, nMC = nMC, sparsity = TRUE,
                 priors = list(alpha = rep(0.001, k))) # sparse on eta
barplot(table(res3$nclusters), xlab= expression(K["+"]),
        col = "blue", border = "red", main = expression(paste("p(",K["+"], "|y)")),
        cex.main=3, yaxt ="n", cex.axis=2.4, cex.names=2.4,
        cex.lab=2)
```


<!-- The results are shown in Figure~\ref{fish_chains_stan}. As may be noted from the first plot in the top row, Hamiltonian Monte Carlo (HMC) behind Stan seems definitely more suited to explore the five high-density regions without switching the group labels. However, group probabilities (third plot) and group standard deviations (second plot) overlap each other. The perfect MCMC sampler does not exist. -->


## References