Title: | Pivotal Methods for Bayesian Relabelling and k-Means Clustering |
---|---|
Description: | Collection of pivotal algorithms for: relabelling the MCMC chains in order to undo the label switching problem in Bayesian mixture models; fitting sparse finite mixtures; initializing the centers of the classical k-means algorithm in order to obtain a better clustering solution. For further details see Egidi, Pappadà, Pauli and Torelli (2018b)<ISBN:9788891910233>. |
Authors: | Leonardo Egidi[aut, cre], Roberta Pappadà[aut], Francesco Pauli[aut], Nicola Torelli[aut] |
Maintainer: | Leonardo Egidi <[email protected]> |
License: | GPL-2 |
Version: | 0.6.0 |
Built: | 2024-11-26 06:38:50 UTC |
Source: | CRAN |
Perform Maxima Units Search (MUS) algorithm on a large and sparse matrix in order to find a set of pivotal units through a sequential search in the given matrix.
MUS(C, clusters, prec_par = 10)
MUS(C, clusters, prec_par = 10)
C |
|
clusters |
A vector of integers from |
prec_par |
Optional argument. The maximum number of candidate pivots for each group. Default is 10. |
Consider distinct partitions of a set of
-dimensional
statistical units into
groups determined by some
clustering technique. A
co-association matrix
with generic element
can be constructed,
where
is the number of times the
-th and the
-th unit
are assigned to the same cluster with respect to the clustering ensemble.
Units which are very distant
from each other are likely to have zero co-occurrences; as a consequence,
is
a square symmetric matrix expected to contain a non-negligible number of zeros.
The main task of the MUS algorithm is to detect submatrices of small
rank from the co-association matrix
and extract those units—pivots—such
that the
submatrix of
,
determined by only the pivotal rows
and columns indexes, is identical or nearly identical.
Practically, the resulting units
have the desirable property to be representative of
the group they belong to.
With the argument prec_par
the user may increase
the powerful of the underlying MUS algorithm (see @egidi2018mus for details).
Given the default value 10, the function internally computes an
effective prec_par
as ,
where
is the number of units belonging to the group
.
pivots |
A vector of integers in 1:N denoting the indeces of the |
prec_par |
The effective number of alternative pivots considered for each group. See Details. |
Leonardo Egidi [email protected], Roberta Pappadà
Egidi, L., Pappadà, R., Pauli, F., Torelli, N. (2018). Maxima Units Search(MUS) algorithm: methodology and applications. In: Perna, C. , Pratesi, M., Ruiz-Gazen A. (eds.) Studies in Theoretical and Applied Statistics, Springer Proceedings in Mathematics and Statistics 227, pp. 71–81.
# Data generated from a mixture of three bivariate Gaussian distributions ## Not run: N <- 620 centers <- 3 n1 <- 20 n2 <- 100 n3 <- 500 x <- matrix(NA, N,2) truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3)) x[1:n1,]=rmvnorm(n1, c(1,5), sigma=diag(2)) x[(n1+1):(n1+n2),]=rmvnorm(n2, c(4,0), sigma=diag(2)) x[(n1+n2+1):(n1+n2+n3),]=rmvnorm(n3, c(6,6), sigma=diag(2)) # Build a similarity matrix from clustering ensembles H <- 1000 a <- matrix(NA, H, N) for (h in 1:H){ a[h,] <- kmeans(x,centers)$cluster } sim_matr <- matrix(NA, N,N) for (i in 1:(N-1)){ for (j in (i+1):N){ sim_matr[i,j] <- sum(a[,i]==a[,j])/H sim_matr[j,i] <- sim_matr[i,j] } } # Obtain a clustering solution via kmeans with multiple random seeds cl <- KMeans(x, centers)$cluster # Find three pivots mus_alg <- MUS(C = sim_matr, clusters = cl) ## End(Not run)
# Data generated from a mixture of three bivariate Gaussian distributions ## Not run: N <- 620 centers <- 3 n1 <- 20 n2 <- 100 n3 <- 500 x <- matrix(NA, N,2) truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3)) x[1:n1,]=rmvnorm(n1, c(1,5), sigma=diag(2)) x[(n1+1):(n1+n2),]=rmvnorm(n2, c(4,0), sigma=diag(2)) x[(n1+n2+1):(n1+n2+n3),]=rmvnorm(n3, c(6,6), sigma=diag(2)) # Build a similarity matrix from clustering ensembles H <- 1000 a <- matrix(NA, H, N) for (h in 1:H){ a[h,] <- kmeans(x,centers)$cluster } sim_matr <- matrix(NA, N,N) for (i in 1:(N-1)){ for (j in (i+1):N){ sim_matr[i,j] <- sum(a[,i]==a[,j])/H sim_matr[j,i] <- sim_matr[i,j] } } # Obtain a clustering solution via kmeans with multiple random seeds cl <- KMeans(x, centers)$cluster # Find three pivots mus_alg <- MUS(C = sim_matr, clusters = cl) ## End(Not run)
Perform classical k-means clustering on a data matrix using pivots as initial centers.
piv_KMeans( x, centers, alg.type = c("kmeans", "hclust"), method = "average", piv.criterion = c("MUS", "maxsumint", "minsumnoint", "maxsumdiff"), H = 1000, iter.max = 10, nstart = 10, prec_par = 10 )
piv_KMeans( x, centers, alg.type = c("kmeans", "hclust"), method = "average", piv.criterion = c("MUS", "maxsumint", "minsumnoint", "maxsumdiff"), H = 1000, iter.max = 10, nstart = 10, prec_par = 10 )
x |
A |
centers |
The number of groups for the the |
alg.type |
The clustering algorithm for the initial partition of the
|
method |
If |
piv.criterion |
The pivotal criterion used for identifying one pivot
for each group. Possible choices are: |
H |
The number of distinct |
iter.max |
If |
nstart |
If |
prec_par |
If |
The function implements a modified version of k-means which aims at
improving the clustering solution starting from a careful seeding.
In particular, it performs a pivot-based initialization step
using pivotal methods to find the initial centers
for the clustering procedure. The starting point consists of multiple
runs of the classical k-means by selecting nstart>1
in the
kmeans
function,
with a fixed number of clusters
in order to build the co-association matrix of data units.
A list with components
cluster |
A vector of integers indicating the cluster to which each point is allocated. |
centers |
A matrix of cluster centers (centroids). |
coass |
The co-association matrix built from ensemble clustering. |
pivots |
The pivotal units identified by the selected pivotal criterion. |
totss |
The total sum of squares. |
withinss |
The within-cluster sum of squares for each cluster. |
tot.withinss |
The within-cluster sum of squares summed across clusters. |
betwennss |
The between-cluster sum of squared distances. |
size |
The number of points in each cluster. |
iter |
The number of (outer) iterations. |
ifault |
integer: indicator of a possible algorithm problem (for experts). |
Leonardo Egidi [email protected], Roberta Pappada
Egidi, L., Pappadà, R., Pauli, F., Torelli, N. (2018). K-means seeding via MUS algorithm. Conference Paper, Book of Short Papers, SIS2018, ISBN: 9788891910233.
# Data generated from a mixture of three bivariate Gaussian distributions ## Not run: N <- 620 k <- 3 n1 <- 20 n2 <- 100 n3 <- 500 x <- matrix(NA, N,2) truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3)) x[1:n1,] <- rmvnorm(n1, c(1,5), sigma=diag(2)) x[(n1+1):(n1+n2),] <- rmvnorm(n2, c(4,0), sigma=diag(2)) x[(n1+n2+1):(n1+n2+n3),] <- rmvnorm(n3, c(6,6), sigma=diag(2)) # Apply piv_KMeans with MUS as pivotal criterion res <- piv_KMeans(x, k) # Apply piv_KMeans with maxsumdiff as pivotal criterion res2 <- piv_KMeans(x, k, piv.criterion ="maxsumdiff") # Plot the data and the clustering solution par(mfrow=c(1,2), pty="s") colors_cluster <- c("grey", "darkolivegreen3", "coral") colors_centers <- c("black", "darkgreen", "firebrick") graphics::plot(x, col = colors_cluster[truegroup], bg= colors_cluster[truegroup], pch=21, xlab="x[,1]", ylab="x[,2]", cex.lab=1.5, main="True data", cex.main=1.5) graphics::plot(x, col = colors_cluster[res$cluster], bg=colors_cluster[res$cluster], pch=21, xlab="x[,1]", ylab="x[,2]", cex.lab=1.5, main="piv_KMeans", cex.main=1.5) points(x[res$pivots, 1], x[res$pivots, 2], pch=24, col=colors_centers,bg=colors_centers, cex=1.5) points(res$centers, col = colors_centers[1:k], pch = 8, cex = 2) ## End(Not run)
# Data generated from a mixture of three bivariate Gaussian distributions ## Not run: N <- 620 k <- 3 n1 <- 20 n2 <- 100 n3 <- 500 x <- matrix(NA, N,2) truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3)) x[1:n1,] <- rmvnorm(n1, c(1,5), sigma=diag(2)) x[(n1+1):(n1+n2),] <- rmvnorm(n2, c(4,0), sigma=diag(2)) x[(n1+n2+1):(n1+n2+n3),] <- rmvnorm(n3, c(6,6), sigma=diag(2)) # Apply piv_KMeans with MUS as pivotal criterion res <- piv_KMeans(x, k) # Apply piv_KMeans with maxsumdiff as pivotal criterion res2 <- piv_KMeans(x, k, piv.criterion ="maxsumdiff") # Plot the data and the clustering solution par(mfrow=c(1,2), pty="s") colors_cluster <- c("grey", "darkolivegreen3", "coral") colors_centers <- c("black", "darkgreen", "firebrick") graphics::plot(x, col = colors_cluster[truegroup], bg= colors_cluster[truegroup], pch=21, xlab="x[,1]", ylab="x[,2]", cex.lab=1.5, main="True data", cex.main=1.5) graphics::plot(x, col = colors_cluster[res$cluster], bg=colors_cluster[res$cluster], pch=21, xlab="x[,1]", ylab="x[,2]", cex.lab=1.5, main="piv_KMeans", cex.main=1.5) points(x[res$pivots, 1], x[res$pivots, 2], pch=24, col=colors_centers,bg=colors_centers, cex=1.5) points(res$centers, col = colors_centers[1:k], pch = 8, cex = 2) ## End(Not run)
Perform MCMC JAGS sampling or HMC Stan sampling for Gaussian mixture models, post-process the chains and apply a clustering technique to the MCMC sample. Pivotal units for each group are selected among four alternative criteria.
piv_MCMC( y, k, nMC, priors, piv.criterion = c("MUS", "maxsumint", "minsumnoint", "maxsumdiff"), clustering = c("diana", "hclust"), software = c("rjags", "rstan"), burn = 0.5 * nMC, chains = 4, cores = 1, sparsity = FALSE )
piv_MCMC( y, k, nMC, priors, piv.criterion = c("MUS", "maxsumint", "minsumnoint", "maxsumdiff"), clustering = c("diana", "hclust"), software = c("rjags", "rstan"), burn = 0.5 * nMC, chains = 4, cores = 1, sparsity = FALSE )
y |
|
k |
Number of mixture components. |
nMC |
Number of MCMC iterations for the JAGS/Stan function execution. |
priors |
Input prior hyperparameters (see Details for default options). |
piv.criterion |
The pivotal criterion used for identifying one pivot
for each group. Possible choices are: |
clustering |
The algorithm adopted for partitioning the
|
software |
The selected MCMC method to fit the model: |
burn |
The burn-in period (only if method |
chains |
A positive integer specifying the number of Markov chains. The default is 4. |
cores |
The number of cores to use when executing the Markov chains in parallel (only if
|
sparsity |
Allows for sparse finite mixtures, default is |
The function fits univariate and multivariate Bayesian Gaussian mixture models of the form (here for univariate only):
where the ,
, are i.i.d. random variables,
,
is the group variance,
are the
latent group allocation, and
The likelihood of the model is then
where
are the component-specific parameters and
the mixture weights. Let
denote a permutation of
,
and let
,
,
be the
corresponding permutations of
,
and
.
Denote by
the set of all the permutations of the indexes
, the likelihood above is invariant under any
permutation
, that is
As a consequence, the model is unidentified with respect to an
arbitrary permutation of the labels.
When Bayesian inference for the model is performed,
if the prior distribution is invariant under a permutation of the indices, then so is the posterior. That is, if
, then
is multimodal with (at least) modes.
Depending on the selected software, the model parametrization
changes in terms of the prior choices.
Precisely, the JAGS philosophy with the underlying Gibbs sampling
is to use noninformative priors, and conjugate priors are
preferred for computational speed.
Conversely, Stan adopts weakly informative priors,
with no need to explicitly use the conjugacy.
For univariate mixtures, when
software="rjags"
the specification is the same as the function BMMmodel
of the
bayesmix
package:
with default values: , in accordance with the default
specification:
priors=list(kind = "independence", parameter = "priorsFish",
hierarchical = "tau")
(see bayesmix
for further details and choices).
When software="rstan"
, the prior specification is:
with default values: .
The users may specify new hyperparameter values with the argument:
priors=list(mu_0=1, B0inv=0.2, mu_sigma=3, tau_sigma=5)
For multivariate mixtures, when software="rjags"
the prior specification is the following:
where is a
-dimensional vector
and
and
are positive definite matrices. By default,
,
and
and
are diagonal matrices,
with diagonal elements
equal to 1e+05. The user may specify other values for the hyperparameters
and
via
priors
argument in such a way:
priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)
with the constraint for and
to be positive definite,
and
a vector of dimension
with nonnegative elements.
When software="rstan"
, the prior specification is:
The covariance matrix is expressed in terms of the LDL decomposition as ,
a variant of the classical Cholesky decomposition, where
is a
lower unit triangular matrix and
is a
diagonal matrix.
The Cholesky correlation factor
is assigned a LKJ prior with
degrees of freedom, which,
combined with priors on the standard deviations of each component, induces a prior on the covariance matrix;
as
the magnitude of correlations between components decreases,
whereas
leads to a uniform prior distribution for
.
By default, the hyperparameters are
,
.
The user may propose some different values with the argument:
priors=list(mu_0=c(1,2), sigma_d = 4, epsilon =2)
If software="rjags"
the function performs JAGS sampling using the bayesmix
package
for univariate Gaussian mixtures, and the runjags
package for multivariate Gaussian mixtures. If software="rstan"
the function performs
Hamiltonian Monte Carlo (HMC) sampling via the rstan
package (see the vignette and the Stan project
for any help).
After MCMC sampling, this function
clusters the units in k
groups,
calls the piv_sel()
function and yields the
pivots obtained from one among four different
methods (the user may specify one among them via piv.criterion
argument):
"maxsumint"
, "minsumnoint"
, "maxsumdiff"
and "MUS"
(available only if k <= 4
)
(see the vignette for thorough details). Due to computational reasons
clarified in the Details section of the function piv_rel
, the
length of the MCMC chains will be minor or equal than the input
argument nMC
; this length, corresponding to the value
true.iter
returned by the procedure, is the number of
MCMC iterations for which
the number of JAGS/Stan groups exactly coincides with the prespecified
number of groups k
.
The function gives the MCMC output, the clustering solutions and the pivotal indexes. Here there is a complete list of outputs.
true.iter |
The number of MCMC iterations for which
the number of JAGS/Stan groups exactly coincides with the prespecified
number of groups |
Mu |
An estimate of the groups' means. |
groupPost |
|
mcmc_mean |
If |
mcmc_sd |
If |
mcmc_weight |
A |
mcmc_mean_raw |
If |
mcmc_sd_raw |
If |
mcmc_weight_raw |
A |
C |
The |
grr |
The vector of cluster membership returned by
|
pivots |
The vector of indices of pivotal units identified by the selected pivotal criterion. |
model |
The JAGS/Stan model code. Apply the |
stanfit |
An object of S4 class |
Leonardo Egidi [email protected]
Egidi, L., Pappadà, R., Pauli, F. and Torelli, N. (2018). Relabelling in Bayesian Mixture Models by Pivotal Units. Statistics and Computing, 28(4), 957-969.
### Bivariate simulation ## Not run: N <- 200 k <- 4 D <- 2 nMC <- 1000 M1 <- c(-.5,8) M2 <- c(25.5,.1) M3 <- c(49.5,8) M4 <- c(63.0,.1) Mu <- rbind(M1,M2,M3,M4) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) ## rjags (default) res <- piv_MCMC(y = sim$y, k =k, nMC = nMC) ## rstan res_stan <- piv_MCMC(y = sim$y, k =k, nMC = nMC, software ="rstan") # changing priors res2 <- piv_MCMC(y = sim$y, priors = list ( mu_0=c(1,1), S2 = matrix(c(0.002,0,0, 0.1),2,2, byrow=TRUE), S3 = matrix(c(0.1,0,0,0.1), 2,2, byrow =TRUE)), k = k, nMC = nMC) ## End(Not run) ### Fishery data (bayesmix package) ## Not run: library(bayesmix) data(fish) y <- fish[,1] k <- 5 nMC <- 5000 res <- piv_MCMC(y = y, k = k, nMC = nMC) # changing priors res2 <- piv_MCMC(y = y, priors = list(kind = "condconjugate", parameter = "priorsRaftery", hierarchical = "tau"), k =k, nMC = nMC) ## End(Not run)
### Bivariate simulation ## Not run: N <- 200 k <- 4 D <- 2 nMC <- 1000 M1 <- c(-.5,8) M2 <- c(25.5,.1) M3 <- c(49.5,8) M4 <- c(63.0,.1) Mu <- rbind(M1,M2,M3,M4) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) ## rjags (default) res <- piv_MCMC(y = sim$y, k =k, nMC = nMC) ## rstan res_stan <- piv_MCMC(y = sim$y, k =k, nMC = nMC, software ="rstan") # changing priors res2 <- piv_MCMC(y = sim$y, priors = list ( mu_0=c(1,1), S2 = matrix(c(0.002,0,0, 0.1),2,2, byrow=TRUE), S3 = matrix(c(0.1,0,0,0.1), 2,2, byrow =TRUE)), k = k, nMC = nMC) ## End(Not run) ### Fishery data (bayesmix package) ## Not run: library(bayesmix) data(fish) y <- fish[,1] k <- 5 nMC <- 5000 res <- piv_MCMC(y = y, k = k, nMC = nMC) # changing priors res2 <- piv_MCMC(y = y, priors = list(kind = "condconjugate", parameter = "priorsRaftery", hierarchical = "tau"), k =k, nMC = nMC) ## End(Not run)
Plot and visualize MCMC outputs and posterior relabelled chains/estimates.
piv_plot( y, mcmc, rel_est, par = c("mean", "sd", "weight", "all"), type = c("chains", "hist") )
piv_plot( y, mcmc, rel_est, par = c("mean", "sd", "weight", "all"), type = c("chains", "hist") )
y |
Data vector or matrix. |
mcmc |
The ouptut of the raw MCMC sampling, as provided by |
rel_est |
Pivotal estimates as provided by |
par |
The parameters for which estimates are displayed. Choose among: |
type |
Type of plots required. Choose among: |
Leonardo Egidi [email protected]
# Fishery data ## Not run: library(bayesmix) data(fish) y <- fish[,1] N <- length(y) k <- 5 nMC <- 5000 res <- piv_MCMC(y = y, k = k, nMC = nMC) rel <- piv_rel(mcmc=res, nMC = nMC) piv_plot(y, res, rel, "chains") piv_plot(y, res, rel, "estimates") piv_plot(y, res, rel, "hist") ## End(Not run)
# Fishery data ## Not run: library(bayesmix) data(fish) y <- fish[,1] N <- length(y) k <- 5 nMC <- 5000 res <- piv_MCMC(y = y, k = k, nMC = nMC) rel <- piv_rel(mcmc=res, nMC = nMC) piv_plot(y, res, rel, "chains") piv_plot(y, res, rel, "estimates") piv_plot(y, res, rel, "hist") ## End(Not run)
This function allows to perform the pivotal relabelling procedure described in Egidi et al. (2018) and to obtain the relabelled posterior estimates.
piv_rel(mcmc)
piv_rel(mcmc)
mcmc |
The output of the MCMC sampling from |
Prototypical models in which the label switching problem arises
are mixture models, as explained in the Details section of
the piv_MCMC
function.
These models are unidentified with respect to an arbitrary permutation
of the labels . Relabelling means permuting
the labels at each iteration of the Markov chain in such
a way that the relabelled chain can be used to draw inferences
on component-specific parameters.
We assume here that a MCMC sample is obtained for the
posterior distribution of a Gaussian mixture model–for instance via
piv_MCMC
function–with a prior distribution which is
labelling invariant.
Furthermore, suppose that we can find units, one
for each group, which are (pairwise) separated with (posterior)
probability one
(that is, the posterior probability of any two of them being
in the same group
is zero).
It is then straightforward to use the
units,
called pivots in what follows and denoted by the indexes
, to identify the groups and to
relabel the chains:
for each MCMC iteration
(
corresponds to
the argument
nMC
) and group
, set
The applicability of this strategy is limited by the existence of the pivots,
which is not guaranteed. The existence of the pivots is a requirement of the
method, meaning that its use is restricted to those chains—or
those parts of a chain—for which the pivots are present. First, although the
model is based on a mixture of components, each iteration of the chain
may imply a different number of non-empty groups. Let then
be the number of non-empty groups at iteration
,
where is the cardinality of the set
. Hence, the relabelling
procedure outlined above can be used only for the subset of the chain
for which
; let it be
which correspond to the argument true.iter
given by piv_MCMC
.
This means that the resulting relabelled chain is not a sample (of size )
from the posterior distribution, but a sample (of size
)
from the posterior
distribution conditional on there being (exactly)
non-empty groups.
Even if
non-empty groups are available, however,
there may not be
perfectly separated units. Let us define
that is, the set of iterations where (at least) two pivots are in the same
group.
In order for the pivot method to be applicable,
we need to exclude iterations ;
that is, we can perform the pivot relabelling on
, corresponding to the argument
final_it
.
This function gives the relabelled posterior estimates–both mean and medians–obtained from the Markov chains of the MCMC sampling.
final_it |
The final number of valid MCMC iterations, as explained in Details. |
final_it_p |
The proportion of final valid MCMC iterations. |
rel_mean |
The relabelled chains of the means: a |
rel_sd |
The relabelled chains of the sd's: a |
rel_weight |
The relabelled chains of the weights: a |
Leonardo Egidi [email protected]
Egidi, L., Pappadà, R., Pauli, F. and Torelli, N. (2018). Relabelling in Bayesian Mixture Models by Pivotal Units. Statistics and Computing, 28(4), 957-969.
#Univariate simulation ## Not run: N <- 250 nMC <- 2500 k <- 3 p <- rep(1/k,k) x <- 3 stdev <- cbind(rep(1,k), rep(20,k)) Mu <- seq(-trunc(k/2)*x,trunc(k/2)*x,length=k) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, stdev = stdev, W=W) res <- piv_MCMC(y = sim$y, k =k, nMC = nMC) rel <- piv_rel(mcmc=res) ## End(Not run) #Bivariate simulation ## Not run: N <- 200 k <- 3 D <- 2 nMC <- 5000 M1 <- c(-.5,8) M2 <- c(25.5,.1) M3 <- c(49.5,8) Mu <- matrix(rbind(M1,M2,M3),c(k,2)) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) res <- piv_MCMC(y = sim$y, k = k, nMC = nMC) rel <- piv_rel(mcmc = res) piv_plot(y=sim$y, mcmc=res, rel_est = rel, type="chains") piv_plot(y=sim$y, mcmc=res, rel_est = rel, type="hist") ## End(Not run)
#Univariate simulation ## Not run: N <- 250 nMC <- 2500 k <- 3 p <- rep(1/k,k) x <- 3 stdev <- cbind(rep(1,k), rep(20,k)) Mu <- seq(-trunc(k/2)*x,trunc(k/2)*x,length=k) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, stdev = stdev, W=W) res <- piv_MCMC(y = sim$y, k =k, nMC = nMC) rel <- piv_rel(mcmc=res) ## End(Not run) #Bivariate simulation ## Not run: N <- 200 k <- 3 D <- 2 nMC <- 5000 M1 <- c(-.5,8) M2 <- c(25.5,.1) M3 <- c(49.5,8) Mu <- matrix(rbind(M1,M2,M3),c(k,2)) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) res <- piv_MCMC(y = sim$y, k = k, nMC = nMC) rel <- piv_rel(mcmc = res) piv_plot(y=sim$y, mcmc=res, rel_est = rel, type="chains") piv_plot(y=sim$y, mcmc=res, rel_est = rel, type="hist") ## End(Not run)
Finding pivotal units from a data partition and a co-association matrix C according to three different methods.
piv_sel(C, clusters)
piv_sel(C, clusters)
C |
A |
clusters |
A vector of integers from |
Given a set of observations
(
may be a
-dimensional vector,
),
consider clustering methods to obtain
distinct partitions
into
groups.
The matrix
C
is the co-association matrix,
where , with
the number of times
the pair
is assigned to the same
cluster among the
partitions.
Let be the group containing units
,
the user may choose
that
maximizes one of the quantities:
or
These methods give the unit that maximizes the global
within similarity ("maxsumint"
) and the unit that
maximizes the difference between global within and
between similarities ("maxsumdiff"
), respectively.
Alternatively, we may choose , which minimizes:
obtaining the most distant unit among the members
that minimize the global dissimilarity between one group
and all the others ("minsumnoint"
).
See the vignette for further details.
pivots |
A matrix with |
Leonardo Egidi [email protected]
Egidi, L., Pappadà, R., Pauli, F. and Torelli, N. (2018). Relabelling in Bayesian Mixture Models by Pivotal Units. Statistics and Computing, 28(4), 957-969.
# Iris data data(iris) # select the columns of variables x<- iris[,1:4] N <- nrow(x) H <- 1000 a <- matrix(NA, H, N) # Perform H k-means partitions for (h in 1:H){ a[h,] <- kmeans(x, centers = 3)$cluster } # Build the co-association matrix C <- matrix(NA, N,N) for (i in 1:(N-1)){ for (j in (i+1):N){ C[i,j] <- sum(a[,i]==a[,j])/H C[j,i] <- C[i,j] }} km <- kmeans(x, centers =3) # Apply three pivotal criteria to the co-association matrix ris <- piv_sel(C, clusters = km$cluster) graphics::plot(iris[,1], iris[,2], xlab ="Sepal.Length", ylab= "Sepal.Width", col = km$cluster) # Add the pivots chosen by the maxsumdiff criterion points( x[ris$pivots[,3], 1:2], col = 1:3, cex =2, pch = 8 )
# Iris data data(iris) # select the columns of variables x<- iris[,1:4] N <- nrow(x) H <- 1000 a <- matrix(NA, H, N) # Perform H k-means partitions for (h in 1:H){ a[h,] <- kmeans(x, centers = 3)$cluster } # Build the co-association matrix C <- matrix(NA, N,N) for (i in 1:(N-1)){ for (j in (i+1):N){ C[i,j] <- sum(a[,i]==a[,j])/H C[j,i] <- C[i,j] }} km <- kmeans(x, centers =3) # Apply three pivotal criteria to the co-association matrix ris <- piv_sel(C, clusters = km$cluster) graphics::plot(iris[,1], iris[,2], xlab ="Sepal.Length", ylab= "Sepal.Width", col = km$cluster) # Add the pivots chosen by the maxsumdiff criterion points( x[ris$pivots[,3], 1:2], col = 1:3, cex =2, pch = 8 )
Simulate observations from a nested Gaussian mixture model
with
pre-specified components under uniform group probabilities
,
where each group is in turn
drawn from a further level consisting of two subgroups.
piv_sim( N, k, Mu, stdev, Sigma.p1 = diag(2), Sigma.p2 = 100 * diag(2), W = c(0.5, 0.5) )
piv_sim( N, k, Mu, stdev, Sigma.p1 = diag(2), Sigma.p2 = 100 * diag(2), W = c(0.5, 0.5) )
N |
The desired sample size. |
k |
The desired number of mixture components. |
Mu |
The input mean vector of length |
stdev |
For univariate mixtures, the |
Sigma.p1 |
The |
Sigma.p2 |
The |
W |
The vector for the mixture weights of the two subgroups. |
The functions allows to simulate values from a double (nested) univariate Gaussian mixture:
or from a multivariate nested Gaussian mixture:
where is the variance for the group
and
the subgroup
(
stdev
is the
argument for specifying the k x 2
standard deviations
for univariate mixtures);
is the covariance matrix for the
subgroup
, where the two matrices are
specified by
Sigma.p1
and Sigma.p2
respectively; and
are the mean input vector and matrix respectively,
specified by the argument
Mu
;
W
is a vector of dimension 2 for the subgroups weights.
y |
The |
true.group |
A vector of integers from |
subgroups |
A |
# Bivariate mixture simulation with three components N <- 2000 k <- 3 D <- 2 M1 <- c(-45,8) M2 <- c(45,.1) M3 <- c(100,8) Mu <- rbind(M1,M2,M3) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) graphics::plot(sim$y, xlab="y[,1]", ylab="y[,2]")
# Bivariate mixture simulation with three components N <- 2000 k <- 3 D <- 2 M1 <- c(-45,8) M2 <- c(45,.1) M3 <- c(100,8) Mu <- rbind(M1,M2,M3) Sigma.p1 <- diag(D) Sigma.p2 <- 20*diag(D) W <- c(0.2,0.8) sim <- piv_sim(N = N, k = k, Mu = Mu, Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W) graphics::plot(sim$y, xlab="y[,1]", ylab="y[,2]")