Title: | Anthology of Mixture Analysis Tools |
---|---|
Description: | Fits finite Bayesian mixture models with a random number of components. The MCMC algorithm implemented is based on point processes as proposed by Argiento and De Iorio (2019) <arXiv:1904.09733> and offers a more computationally efficient alternative to reversible jump. Different mixture kernels can be specified: univariate Gaussian, multivariate Gaussian, univariate Poisson, and multivariate Bernoulli (latent class analysis). For the parameters characterising the mixture kernel, we specify conjugate priors, with possibly user specified hyper-parameters. We allow for different choices for the prior on the number of components: shifted Poisson, negative binomial, and point masses (i.e. mixtures with fixed number of components). |
Authors: | Priscilla Ong [aut, edt], Raffaele Argiento [aut], Bruno Bodin [aut, cre], Maria De Iorio [aut] |
Maintainer: | Bruno Bodin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-12-30 08:25:15 UTC |
Source: | CRAN |
Given an AM_mcmc_output
object, this function returns the clustering matrix.
AM_clustering(fit)
AM_clustering(fit)
fit |
an |
The clustering matrix is an M by n matrix. Each of the M rows represents a clustering of n items using cluster labels. Items i and j are in the same cluster if fit[m,i] == fit[m,j] for the mth clustering.
A numeric clustering matrix
fit = AM_demo_uvp_poi()$fit ccm <- AM_clustering(fit)
fit = AM_demo_uvp_poi()$fit ccm <- AM_clustering(fit)
Given an AM_mcmc_output
object, this function returns the co-clustering matrix.
AM_coclustering(fit)
AM_coclustering(fit)
fit |
an |
The co-clustering matrix is produced by the simultaneous clustering of the rows and columns. Each entry denotes the (posterior) probability
that items and
are together. This technique is also known as
bi-clustering and block clustering (Govaert and Nadif 2013), and is useful for understanding the number of clusters in the dataset.
A numeric co-clustering matrix
fit = AM_demo_uvp_poi()$fit ccm <- AM_coclustering(fit)
fit = AM_demo_uvp_poi()$fit ccm <- AM_coclustering(fit)
AM_mcmc_fit
output produced by the multivariate bernoulli modelThis function allows us to generate a sample output of fitting the multivariate Bernoulli model. No arguments are needed to be passed.
The purpose of this function is to serve as a demo for users to understand the model's output, without diving too deep into details. By default,
this demo generates a sample dataset of dimension 500x4, where the MCMC sampler is specified to run for 2000 iterations, with a burn-in of 1000, and a thinning interval of 10. All possible outputs
that can be produced by AM_mcmc_fit
are returned (see return value below).
AM_demo_mvb_poi()
AM_demo_mvb_poi()
A list containing the following items:
the vector (or matrix) containing the synthetic data used to fit the model.
the vector containing the final cluster assignment of each observation.
an AM_mcmc_output
object, which is the typical output of AM_mcmc_fit
.
mvb_output <- AM_demo_mvb_poi()
mvb_output <- AM_demo_mvb_poi()
AM_mcmc_fit
output produced by the multivariate gaussian modelThis function allows us to generate a sample output of fitting the multivariate Gaussian model. No arguments are needed to be passed.
The purpose of this function is to serve as a demo for users to understand the model's output, without diving too deep into details. By default,
this demo generates a sample dataset of dimension 500x2, where the MCMC sampler is specified to run for 2000 iterations, with a burn-in of 1000, and a thinning interval of 10. All possible outputs
that can be produced by AM_mcmc_fit
are returned (see return value below).
AM_demo_mvn_poi()
AM_demo_mvn_poi()
A list containing the following items:
the vector (or matrix) containing the synthetic data used to fit the model.
the vector containing the final cluster assignment of each observation.
an AM_mcmc_output
object, which is the typical output of AM_mcmc_fit
.
mvn_output <- AM_demo_mvn_poi()
mvn_output <- AM_demo_mvn_poi()
AM_mcmc_fit
output produced by the univariate Gaussian modelThis function allows us to generate a sample output of fitting the univariate gaussian model. No arguments are needed to be passed.
The purpose of this function is to serve as a demo for users to understand the model's output, without diving too deep into details. By default,
this demo generates a sample dataset of dimension 500x1, where the MCMC sampler is specified to run for 2000 iterations, with a burn-in of 1000, and a thinning interval of 10. All possible outputs
that can be produced by AM_mcmc_fit
are returned (see return value below).
AM_demo_uvn_poi()
AM_demo_uvn_poi()
A list containing the following items:
the vector (or matrix) containing the synthetic data used to fit the model.
the vector containing the final cluster assignment of each observation.
an AM_mcmc_output
object, which is the typical output of AM_mcmc_fit
.
mvn_output <- AM_demo_uvn_poi()
mvn_output <- AM_demo_uvn_poi()
AM_mcmc_fit
output produced by the univariate Poisson modelThis function allows us to generate a sample output of fitting the univariate poisson model. No arguments are needed to be passed.
The purpose of this function is to serve as a demo for users to understand the model's output, without diving too deep into details. By default,
this demo generates a sample dataset of dimension 500x1, where the MCMC sampler is specified to run for 2000 iterations, with a burn-in of 1000, and a thinning interval of 10. All possible outputs
that can be produced by AM_mcmc_fit
are returned (see return value below).
AM_demo_uvp_poi()
AM_demo_uvp_poi()
A list containing the following items:
the vector (or matrix) containing the synthetic data used to fit the model.
the vector containing the final cluster assignment of each observation.
an AM_mcmc_output
object, which is the typical output of AM_mcmc_fit
.
mvn_output <- AM_demo_uvn_poi()
mvn_output <- AM_demo_uvn_poi()
This function computes the hyperparameters of a Normal Inverse-Gamma distribution using an empirical Bayes approach. More information about how these hyperparameters are determined can be found here: Bayes and empirical Bayes: do they merge? (Petrone et al. 2012).
AM_emp_bayes_uninorm(y, scEmu = 1, scEsig2 = 3, CVsig2 = 3)
AM_emp_bayes_uninorm(y, scEmu = 1, scEsig2 = 3, CVsig2 = 3)
y |
The data y. If y is univariate, a vector is expected. Otherwise, y should be a matrix. |
scEmu |
a positive value (default=1) such that marginally E( |
scEsig2 |
a positive value (default=3) such that marginally E( |
CVsig2 |
The coefficient of variation of |
an object of class AM_mix_hyperparams
, in which hyperparameters m0
, k0
,
nu0
and sig02
are specified. To understand the usage of these hyperparameters, please refer to
AM_mix_hyperparams_uninorm
.
AM_mcmc_output
objectGiven an AM_mcmc_output
object, as well as the target variable names,
AM_extract will return a list of the variables of interest.
AM_extract(object, targets, iterations = NULL, debug = FALSE)
AM_extract(object, targets, iterations = NULL, debug = FALSE)
object |
an |
targets |
List of variables to extract (ie. K, M, mu). |
iterations |
Can specify particular iterations to extracts, NULL for all. |
debug |
Activate log to. |
Due to the complexity of AntMAN outputs, AM_mcmc_output
object can be difficult
to handle. The AM_extract function eases access of particular variables within the
AM_mcmc_output
object. Variables of varying dimension are expected to result from the transdimensional moves. When considering such
variables, the extracted list would correspond to an nx1 list, where n refers to the number of extracted iterations. Each of these nx1 entries consists
of another list of dimension mx1, where m specifies the number of components inferred for that iteration.
a list of variables specified in targets
.
hyperparameter of the weights prior to match
,
where
is user-specifiedOnce a fixed value of the number of components is specified, this function adopts a bisection method to find the value of
such that the induced distribution on the number of clusters is centered around a user specifed value
, i.e. the function uses
a bisection method to solve for
(Argiento and Iorio 2019). The user can provide a lower
and
an upper
bound for the possible values of
. The default values are
and
.
A default value for the tolerance is
. Moreover, after a maximum number of iteration (default is 31), the function
stops warning that convergence has not been reached.
AM_find_gamma_Delta( n, Mstar, Kstar = 6, gam_min = 1e-04, gam_max = 10, tolerance = 0.1 )
AM_find_gamma_Delta( n, Mstar, Kstar = 6, gam_min = 1e-04, gam_max = 10, tolerance = 0.1 )
n |
sample size. |
Mstar |
number of components of the mixture. |
Kstar |
mean number of clusters the user wants to specify. |
gam_min |
lower bound of the interval in which |
gam_max |
upper bound of the interval in which |
tolerance |
Level of tolerance for the method. |
A value of gamma
such that
n <- 82 Mstar <- 12 gam_de <- AM_find_gamma_Delta(n,Mstar,Kstar=6, gam_min=1e-4,gam_max=10, tolerance=0.1) prior_K_de <- AM_prior_K_Delta(n,gam_de,Mstar) prior_K_de%*%1:n
n <- 82 Mstar <- 12 gam_de <- AM_find_gamma_Delta(n,Mstar,Kstar=6, gam_min=1e-4,gam_max=10, tolerance=0.1) prior_K_de <- AM_prior_K_Delta(n,gam_de,Mstar) prior_K_de%*%1:n
hyperparameter of the weights
prior to match
, where
is user-specifiedOnce the prior on the number of mixture components M is assumed to be a Negative Binomial with
parameter r>0
and 0<p<1
, with mean is 1+ r*p/(1-p), this function adopts a bisection method
to find the value of gamma
such that the induced distribution on the number of clusters is centered around a
user specifed value , i.e. the function uses a bisection method to solve for
(Argiento and Iorio 2019).
The user can provide a lower
and an upper
bound for the possible values of
. The default values
are
and
. A defaault value for the tolerance is
. Moreover, after a
maximum number of iteration (default is 31), the function stops warning that convergence has not bee reached.
AM_find_gamma_NegBin( n, r, p, Kstar = 6, gam_min = 0.001, gam_max = 10000, tolerance = 0.1 )
AM_find_gamma_NegBin( n, r, p, Kstar = 6, gam_min = 0.001, gam_max = 10000, tolerance = 0.1 )
n |
The sample size. |
r |
The dispersion parameter |
p |
The probability of failure parameter |
Kstar |
The mean number of clusters the user wants to specify. |
gam_min |
The lower bound of the interval in which |
gam_max |
The upper bound of the interval in which |
tolerance |
Level of tolerance of the method. |
A value of gamma
such that
n <- 82 r <- 1 p <- 0.8571 gam_nb= AM_find_gamma_NegBin(n,r,p,Kstar=6, gam_min=0.001,gam_max=10000, tolerance=0.1) prior_K_nb= AM_prior_K_NegBin(n,gam_nb, r, p) prior_K_nb%*%1:n
n <- 82 r <- 1 p <- 0.8571 gam_nb= AM_find_gamma_NegBin(n,r,p,Kstar=6, gam_min=0.001,gam_max=10000, tolerance=0.1) prior_K_nb= AM_prior_K_NegBin(n,gam_nb, r, p) prior_K_nb%*%1:n
hyperparameter of the weights prior to match
, where
is user-specifiedOnce the prior on the number of mixture components M is assumed to be a Shifted Poisson of parameter Lambda
,
this function adopts a bisection method to find the value of such that the induced distribution
on the number of clusters is centered around a user specifed value
, i.e. the function uses a bisection
method to solve for
(Argiento and Iorio 2019). The user can provide a lower
and an upper
bound for the possible values of
. The default values are
and
.
A defaault value for the tolerance is
. Moreover, after a maximum number of iteration (default is 31),
the function stops warning that convergence has not bee reached.
AM_find_gamma_Pois( n, Lambda, Kstar = 6, gam_min = 1e-04, gam_max = 10, tolerance = 0.1 )
AM_find_gamma_Pois( n, Lambda, Kstar = 6, gam_min = 1e-04, gam_max = 10, tolerance = 0.1 )
n |
The sample size. |
Lambda |
The parameter of the Shifted Poisson for the number of components of the mixture. |
Kstar |
The mean number of clusters the user wants to specify. |
gam_min |
The lower bound of the interval in which |
gam_max |
The upper bound of the interval in which |
tolerance |
Level of tolerance of the method. |
A value of gamma
such that
n <- 82 Lam <- 11 gam_po <- AM_find_gamma_Pois(n,Lam,Kstar=6, gam_min=0.0001,gam_max=10, tolerance=0.1) prior_K_po <- AM_prior_K_Pois(n,gam_po,Lam) prior_K_po%*%1:n
n <- 82 Lam <- 11 gam_po <- AM_find_gamma_Pois(n,Lam,Kstar=6, gam_min=0.0001,gam_max=10, tolerance=0.1) prior_K_po <- AM_prior_K_Pois(n,gam_po,Lam) prior_K_po%*%1:n
Output type of return values from AM_mcmc_parameters
.
The AM_mcmc_fit
function performs a Gibbs sampling in order to estimate the mixture comprising the sample data y
.
The mixture selected must be of a predefined type mix_kernel_hyperparams
(defined with AM_mix_hyperparams_*
functions, where star
*
denotes the chosen kernel).
Additionally, a prior distribution on the number of mixture components
must be specified through mix_components_prior
(generated with AM_mix_components_prior_*
functions, where *
denotes the chosen prior). Similarly,
a prior on the weights of the mixture should be specified through mix_weight_prior
(defined with AM_mix_weights_prior_*
functions). Finally, with mcmc_parameters
, the user sets
the MCMC parameters for the Gibbs sampler (defined with AM_mcmc_parameters
functions).
AM_mcmc_fit( y, mix_kernel_hyperparams, initial_clustering = NULL, init_K = NULL, fixed_clustering = NULL, mix_components_prior = AM_mix_components_prior_pois(), mix_weight_prior = AM_mix_weights_prior_gamma(), mcmc_parameters = AM_mcmc_parameters() )
AM_mcmc_fit( y, mix_kernel_hyperparams, initial_clustering = NULL, init_K = NULL, fixed_clustering = NULL, mix_components_prior = AM_mix_components_prior_pois(), mix_weight_prior = AM_mix_weights_prior_gamma(), mcmc_parameters = AM_mcmc_parameters() )
y |
input data, can be a vector or a matrix. |
mix_kernel_hyperparams |
is a configuration list, defined by *_mix_hyperparams functions, where * denotes the chosen kernel.
See |
initial_clustering |
is a vector CI of initial cluster assignement. If no clustering is specified (either as |
init_K |
initial value for the number of cluster. When this is specified, AntMAN intitialises the clustering assign usng K-means. |
fixed_clustering |
if specified, this is the vector CI containing the cluster assignments. This will remain unchanged for every iteration. |
mix_components_prior |
is a configuration list defined by AM_mix_components_prior_* functions, where * denotes the chosen prior.
See |
mix_weight_prior |
is a configuration list defined by AM_weight_prior_* functions, where * denotes the chosen prior specification.
See |
mcmc_parameters |
is a configuration list defined by AM_mcmc_parameters. See |
If no initial clustering is specified (either as init_K
or init_clustering
),
then every observation is allocated to a different cluster.
If init_K
is specified then AntMAN initialises the clustering through K-means.
Warning: if the user does not specify init_K or initial_cluster, the first steps can be be time-consuming because of default setting of the initial clustering.
The return value is an AM_mcmc_output
object.
AM_mcmc_fit( AM_sample_unipois()$y, AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2), mcmc_parameters = AM_mcmc_parameters(niter=50, burnin=0, thin=1, verbose=0))
AM_mcmc_fit( AM_sample_unipois()$y, AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2), mcmc_parameters = AM_mcmc_parameters(niter=50, burnin=0, thin=1, verbose=0))
This function generates an MCMC parameters list to be used as mcmc_parameters
argument within AM_mcmc_fit
.
AM_mcmc_parameters( niter = 5000, burnin = 2500, thin = 1, verbose = 1, output = c("CI", "K"), parallel = TRUE, output_dir = NULL )
AM_mcmc_parameters( niter = 5000, burnin = 2500, thin = 1, verbose = 1, output = c("CI", "K"), parallel = TRUE, output_dir = NULL )
niter |
Total number of MCMC iterations to be carried out. |
burnin |
Number of iterations to be considered as burn-in. Samples from this burn-in period are discarded. |
thin |
Thinning rate. This argument specifies how often a draw from the posterior distribution is stored after
burnin, i.e. one every -th samples is saved. Therefore, the toral number of MCMC samples saved is
( |
verbose |
A value from 0 to 4, that specifies the desired level of verbosity (0:None, 1:Warnings, 2:Debug, 3:Extras). |
output |
A list of parameters output to return. |
parallel |
Some of the algorithms can be run in parallel using OpenMP. When set to True, this parameter triggers the parallelism. |
output_dir |
Path to an output dir, where to store all the outputs. |
An AM_mcmc_configuration
Object. This is a list to be used as mcmc_parameters
argument with AM_mcmc_fit
.
AM_mcmc_parameters (niter=1000, burnin=10000, thin=50) AM_mcmc_parameters (niter=1000, burnin=10000, thin=50, output=c("CI","W","TAU"))
AM_mcmc_parameters (niter=1000, burnin=10000, thin=50) AM_mcmc_parameters (niter=1000, burnin=10000, thin=50, output=c("CI","W","TAU"))
Similar to AM_mcmc_fit
, the AM_mcmc_refit
function performs a Gibbs sampling in order to estimate
a mixture. However parameters will be reused from a previous result from AM_mcmc_fit
.
AM_mcmc_refit(y, fit, fixed_clustering, mcmc_parameters = AM_mcmc_parameters())
AM_mcmc_refit(y, fit, fixed_clustering, mcmc_parameters = AM_mcmc_parameters())
y |
input data, can be a vector or a matrix. |
fit |
previous output from |
fixed_clustering |
is a vector CI of cluster assignment that will remain unchanged for every iterations. |
mcmc_parameters |
is a configuration list defined by |
In practice this function will call AM_mcmc_fit(y, fixed_clustering = fixed_clustering, ...); with the same parameters as previously specified.
The return value is an AM_mcmc_output
object.
y = AM_sample_unipois()$y fit = AM_mcmc_fit( y , AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2), mcmc_parameters = AM_mcmc_parameters(niter=20, burnin=0, thin=1, verbose=0)) eam = AM_coclustering(fit) cluster = AM_salso(eam, "binder") refit = AM_mcmc_refit(y , fit, cluster, mcmc_parameters = AM_mcmc_parameters(niter=20, burnin=0, thin=1, verbose=0));
y = AM_sample_unipois()$y fit = AM_mcmc_fit( y , AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2), mcmc_parameters = AM_mcmc_parameters(niter=20, burnin=0, thin=1, verbose=0)) eam = AM_coclustering(fit) cluster = AM_salso(eam, "binder") refit = AM_mcmc_refit(y , fit, cluster, mcmc_parameters = AM_mcmc_parameters(niter=20, burnin=0, thin=1, verbose=0));
Object returned by AM_mix_components_prior_*
.
AM_mix_components_prior_dirac
,
AM_mix_components_prior_negbin
,
AM_mix_components_prior_pois
Generate a configuration object that assigns a Point mass prior to the number of mixture components.
This is the simplest option and it requires users to specify a value
such that
.
AM_mix_components_prior_dirac(Mstar)
AM_mix_components_prior_dirac(Mstar)
Mstar |
Fixed value |
An AM_mix_components_prior
object. This is a configuration list to be used as mix_components_prior
argument for AM_mcmc_fit
.
AM_mix_components_prior_dirac (Mstar=3)
AM_mix_components_prior_dirac (Mstar=3)
This generates a configuration object for a Shifted Negative Binomial prior on the number of mixture components such that
The hyperparameters (probability of success) and
(size) can either be fixed using
r
and p
or assigned appropriate prior distributions.
In the latter case, we assume and
. In AntMAN we assume the following
parametrization of the Gamma density:
AM_mix_components_prior_negbin( a_R = NULL, b_R = NULL, a_P = NULL, b_P = NULL, R = NULL, P = NULL, init_R = NULL, init_P = NULL )
AM_mix_components_prior_negbin( a_R = NULL, b_R = NULL, a_P = NULL, b_P = NULL, R = NULL, P = NULL, init_R = NULL, init_P = NULL )
a_R |
The shape parameter |
b_R |
The rate parameter |
a_P |
The parameter |
b_P |
The parameter |
R |
It allows to fix |
P |
It allows to fix |
init_R |
The initial value of |
init_P |
The inivial value of |
If no arguments are provided, the default is .
Additionally, when init_R and init_P are not specified, there are default values:
and
.
An AM_mix_components_prior
object. This is a configuration list to be used as mix_components_prior
argument for AM_mcmc_fit
.
AM_mix_components_prior_negbin (R=1, P=1) AM_mix_components_prior_negbin ()
AM_mix_components_prior_negbin (R=1, P=1) AM_mix_components_prior_negbin ()
This function generates a configuration object for a Shifted Poisson prior on the number of mixture components such that
The hyperparameter can either be fixed using
Lambda
or assigned a prior distribution with
a
and b
.
In AntMAN we assume the following parametrization of the Gamma density:
AM_mix_components_prior_pois(a = NULL, b = NULL, Lambda = NULL, init = NULL)
AM_mix_components_prior_pois(a = NULL, b = NULL, Lambda = NULL, init = NULL)
a |
The shape parameter |
b |
The rate parameter |
Lambda |
It allows to set the hyperparameter |
init |
The initial value for |
If no arguments are provided, the default is a prior distribution with a = 1
and b = 1
.
An AM_mix_components_prior
object. This is a configuration list to be used as mix_components_prior
argument for AM_mcmc_fit
.
components_prior = AM_mix_components_prior_pois (init=3, a=1, b=1)
components_prior = AM_mix_components_prior_pois (init=3, a=1, b=1)
Object type returned by AM_mix_hyperparams_*
commands.
AM_mix_hyperparams_unipois
, AM_mix_hyperparams_uninorm
, AM_mix_hyperparams_multiber
,
AM_mix_hyperparams_multinorm
Generate a configuration object that defines the prior hyperparameters for a mixture of multivariate Bernoulli.
If the dimension of the data is P, then the prior is a product of P independent Beta distributions, Beta(). Therefore,
the vectors of hyperparameters, a0 and b0, are P-dimensional. Default is (a0= c(1,....,1),b0= c(1,....,1)).
AM_mix_hyperparams_multiber(a0, b0)
AM_mix_hyperparams_multiber(a0, b0)
a0 |
The a0 hyperparameters. |
b0 |
The b0 hyperparameters. |
An AM_mix_hyperparams
object. This is a configuration list to be used as mix_kernel_hyperparams
argument for AM_mcmc_fit
.
AM_mix_hyperparams_multiber (a0= c(1,1,1,1),b0= c(1,1,1,1))
AM_mix_hyperparams_multiber (a0= c(1,1,1,1),b0= c(1,1,1,1))
Generate a configuration object that specifies a multivariate Normal mixture kernel, where users can specify the hyperparameters for the conjugate prior of the multivariate
Normal mixture. We assume that the data are d-dimensional vectors , where
are i.i.d
Normal random variables with mean
and covariance matrix
.
The conjugate prior is
where mu0
corresponds to ,
ka0
corresponds to ,
nu0
to , and
Lam0
to .
AM_mix_hyperparams_multinorm(mu0 = NULL, ka0 = NULL, nu0 = NULL, Lam0 = NULL)
AM_mix_hyperparams_multinorm(mu0 = NULL, ka0 = NULL, nu0 = NULL, Lam0 = NULL)
mu0 |
The hyperparameter |
ka0 |
The hyperparameter |
nu0 |
The hyperparameter |
Lam0 |
The hyperparameter |
Default is (mu0=c(0,..,0)
, ka0=1
, nu0=Dim+2
, Lam0=diag(Dim))
with Dim
is the dimension of the data y
.
We advise the user to set equal to at least the dimension of the data,
Dim
, plus 2.
An AM_mix_hyperparams
object. This is a configuration list to be used as mix_kernel_hyperparams
argument for AM_mcmc_fit
.
AM_mix_hyperparams_multinorm ()
AM_mix_hyperparams_multinorm ()
Generate a configuration object that specifies a univariate Normal mixture kernel, where users can specify the hyperparameters of the Normal-InverseGamma conjugate prior.
As such, the kernel is a Gaussian distribution with mean and variance
. The prior on
the Normal-InverseGamma:
AM_mix_hyperparams_uninorm(m0, k0, nu0, sig02)
AM_mix_hyperparams_uninorm(m0, k0, nu0, sig02)
m0 |
The |
k0 |
The |
nu0 |
The |
sig02 |
The |
corresponds
m0
,
corresponds
k0
,
corresponds
nu0
, and
corresponds
sig02
.
If hyperparameters are not specified, the default is m0=0
, k0=1
, nu0=3
, sig02=1
.
An AM_mix_hyperparams
object. This is a configuration list to be used as mix_kernel_hyperparams
argument for AM_mcmc_fit
.
#### This example ... data(galaxy) y_uvn = galaxy mixture_uvn_params = AM_mix_hyperparams_uninorm (m0=20.83146, k0=0.3333333, nu0=4.222222, sig02=3.661027) mcmc_params = AM_mcmc_parameters(niter=2000, burnin=500, thin=10, verbose=0) components_prior = AM_mix_components_prior_pois (init=3, a=1, b=1) weights_prior = AM_mix_weights_prior_gamma(init=2, a=1, b=1) fit <- AM_mcmc_fit( y = y_uvn, mix_kernel_hyperparams = mixture_uvn_params, mix_components_prior =components_prior, mix_weight_prior = weights_prior, mcmc_parameters = mcmc_params) summary (fit) plot (fit)
#### This example ... data(galaxy) y_uvn = galaxy mixture_uvn_params = AM_mix_hyperparams_uninorm (m0=20.83146, k0=0.3333333, nu0=4.222222, sig02=3.661027) mcmc_params = AM_mcmc_parameters(niter=2000, burnin=500, thin=10, verbose=0) components_prior = AM_mix_components_prior_pois (init=3, a=1, b=1) weights_prior = AM_mix_weights_prior_gamma(init=2, a=1, b=1) fit <- AM_mcmc_fit( y = y_uvn, mix_kernel_hyperparams = mixture_uvn_params, mix_components_prior =components_prior, mix_weight_prior = weights_prior, mcmc_parameters = mcmc_params) summary (fit) plot (fit)
Generate a configuration object that specifies a univariate Poisson mixture kernel, where users can
specify the hyperparameters of the conjugate Gamma prior, i.e. the kernel is a
and
.
In AntMAN we assume the following
parametrization of the Gamma density:
AM_mix_hyperparams_unipois(alpha0, beta0)
AM_mix_hyperparams_unipois(alpha0, beta0)
alpha0 |
The shape hyperparameter |
beta0 |
The rate hyperparameter |
Note that by default, alpha0=1 and beta0=1.
An AM_mix_hyperparams
object. This is a configuration list to be used as mix_kernel_hyperparams
argument for AM_mcmc_fit
.
AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2)
AM_mix_hyperparams_unipois (alpha0=2, beta0=0.2)
Object type returned by AM_mix_weights_prior_*
commands.
for the Dirichlet mixture weights priorGenerate a configuration object to specify a prior on the hyperparameter for the Dirichlet prior on the
mixture weights.
We assume
. Alternatively, we can fix
to a specific value.
Default is
, where N is the number of observations.
In AntMAN we assume the following
parametrization of the Gamma density:
AM_mix_weights_prior_gamma(a = NULL, b = NULL, gamma = NULL, init = NULL)
AM_mix_weights_prior_gamma(a = NULL, b = NULL, gamma = NULL, init = NULL)
a |
The shape parameter a of the Gamma prior. |
b |
The rate parameter b of the Gamma prior. |
gamma |
It allows to fix |
init |
The init value for |
A AM_mix_weights_prior
object. This is a configuration list to be used as mix_weight_prior
argument for AM_mcmc_fit
.
AM_mix_weights_prior_gamma (a=1, b=1) AM_mix_weights_prior_gamma (a=1, b=1, init=1) AM_mix_weights_prior_gamma (gamma = 3) AM_mix_weights_prior_gamma ()
AM_mix_weights_prior_gamma (a=1, b=1) AM_mix_weights_prior_gamma (a=1, b=1, init=1) AM_mix_weights_prior_gamma (gamma = 3) AM_mix_weights_prior_gamma ()
Given an AM_mcmc_output
object, this function produces the autocorrelation function bars describing the MCMC results. AM_plot_chaincor makes use of bayesplot’s
plotting function mcmc_acf_bar (Gabry et al. 2019).
AM_plot_chaincor(x, tags = NULL, lags = NULL, title = "MCMC Results")
AM_plot_chaincor(x, tags = NULL, lags = NULL, title = "MCMC Results")
x |
An |
tags |
A list of variables to consider. This function only produces meaningful plots for variables that have fixed dimension across the draws. If not specified, plots pertaining to M and K will be produced.
This function is built upon bayesplot's |
lags |
An integer specifying the number of lags to plot. If no value is specified, the default number of lags shown is half the total number of iterations. |
title |
Title for the plot. |
A ggplot object.
AM_mcmc_output
objectGiven an AM_mcmc_output
object, AM_plot_density plots the posterior density of the specified variables of interest. AM_plot_density makes use
of bayesplot's plotting function mcmc_areas (Gabry et al. 2019).
AM_plot_density(x, tags = NULL, title = "MCMC Results")
AM_plot_density(x, tags = NULL, title = "MCMC Results")
x |
An |
tags |
A list of variables to consider. This function only produces meaningful plots for variables that have fixed dimension across the draws. |
title |
Title for the plot. |
a ggplot object visualising the posterior density of the specified variables.
Given an AM_mcmc_output
object, and the data the model was fit on, this function will produce a cluster frequency plot for the multivariate bernoulli model.
AM_plot_mvb_cluster_frequency( fit, y, x_lim_param = c(0.8, 7.2), y_lim_param = c(0, 1) )
AM_plot_mvb_cluster_frequency( fit, y, x_lim_param = c(0.8, 7.2), y_lim_param = c(0, 1) )
fit |
An |
y |
A matrix containing the y observations which produced fit. |
x_lim_param |
A vector with two elements describing the plot's x_axis scale, e.g. c(0.8, 7.2). |
y_lim_param |
A vector with two elements describing the plot's y_axis scale, e.g. c(0, 1). |
No return value. Called for side effects.
AM_mcmc_output
scatterplot matrixvisualise a matrix of plots describing the MCMC results. This function is built upon GGally's plotting function ggpairs (Schloerke et al. 2021).
AM_plot_pairs(x, tags = NULL, title = "MCMC Results")
AM_plot_pairs(x, tags = NULL, title = "MCMC Results")
x |
an |
tags |
A list of variables to consider for plotting. This function only produces meaningful plots for variables that have fixed dimension across the draws. If not specified, plots pertaining to M and K will be produced. |
title |
Title for the plot. |
Same as ggpairs function, a ggmatrix object that if called, will print.
AM_mcmc_output
objectGiven an AM_mcmc_output
object, AM_plot_pmf plots the posterior probability mass function of the specified variables.
AM_plot_pmf(x, tags = NULL, title = "MCMC Results")
AM_plot_pmf(x, tags = NULL, title = "MCMC Results")
x |
An |
tags |
A list of variables to consider. If not specified, the pmf of both M and K will be plotted. |
title |
Title for the plot. |
No return value. Called for side effects.
Given an AM_mcmc_output
object, this function will produce an image of the Similarity Matrix.
AM_plot_similarity_matrix(x, loss, ...)
AM_plot_similarity_matrix(x, loss, ...)
x |
An |
loss |
Loss function to minimise. Specify either "VI" or "binder". If not specified, the default loss to minimise is "binder". |
... |
All additional parameters wil lbe pass to the image command. |
No return value. Called for side effects.
AM_mcmc_output
objectGiven an AM_mcmc_output
object, AM_plot_traces
visualises the traceplots of the specified variables involved in the MCMC inference.
AM_plot_traces is built upon bayesplot's mcmc_trace (Gabry et al. 2019).
AM_plot_traces(x, tags = NULL, title = "MCMC Results")
AM_plot_traces(x, tags = NULL, title = "MCMC Results")
x |
An |
tags |
A list of variables to consider. This function only produces meaningful plots for variables that have fixed dimension across the draws. If not specified, plots pertaining to M and K will be produced. |
title |
Title for the plot |
No return value. Called for side effects.
Given an object of class AM_mcmc_fit
, AM_plot_values visualises the interval estimates of the specified variables involved in the MCMC inference.
AM_plot_values is built upon bayesplot's mcmc_intervals (Gabry et al. 2019).
AM_plot_values(x, tags = NULL, title = "MCMC Results")
AM_plot_values(x, tags = NULL, title = "MCMC Results")
x |
An |
tags |
A list of variables to consider. This function only produces meaningful plots for variables that have fixed dimension across the draws. If not specified, plots pertaining to M and K will be produced. |
title |
Title for the plot. |
No return value. Called for side effects.
Object type returned by AM_prior_*
commands.
AM_prior_K_Delta
, AM_prior_K_Pois
, AM_prior_K_NegBin
This function computes the prior on the number of clusters, i.e. occupied components of the mixture for a Finite Dirichlet process
when the prior on the component-weights of the mixture is a Dirichlet with parameter gamma
(i.e. when unnormalised weights
are distributed as Gamma(,1)). This function can be used when the number of components is fixed to
, i.e.
a Dirac prior assigning mass only to
is assumed. See (Argiento and Iorio 2019) There are no default values.
AM_prior_K_Delta(n, gamma, Mstar)
AM_prior_K_Delta(n, gamma, Mstar)
n |
The sample size. |
gamma |
The |
Mstar |
The number of component of the mixture. |
an AM_prior
object, that is a vector of length n, reporting the values V(n,k)
for k=1,...,n
.
n <- 82 gam_de <- 0.1743555 Mstar <- 12 prior_K_de <- AM_prior_K_Delta(n,gam_de, Mstar) plot(prior_K_de)
n <- 82 gam_de <- 0.1743555 Mstar <- 12 prior_K_de <- AM_prior_K_Delta(n,gam_de, Mstar) plot(prior_K_de)
This function computes the prior on the number of clusters, i.e. occupied component of the mixture for a Finite Dirichlet process when the
prior on the component-weights of the mixture is a Dirichlet with parameter gamma
(i.e. when unnormalized weights are distributed as
Gamma(,1)). This function can be used when the prior on the number of components is Negative Binomial with parameter
and
, with mean
. See (Argiento and Iorio 2019) for more details.
AM_prior_K_NegBin(n, gamma, r, p)
AM_prior_K_NegBin(n, gamma, r, p)
n |
The sample size. |
gamma |
The |
r |
The dispersion parameter |
p |
The probability of failure parameter |
There are no default values.
an AM_prior
object, that is a vector of length n, reporting the values V(n,k)
for k=1,...,n
.
n <- 50 gamma <- 1 r <- 0.1 p <- 0.91 gam_nb <- 0.2381641 prior_K_nb <- AM_prior_K_NegBin(n,gam_nb,r,p) plot(prior_K_nb)
n <- 50 gamma <- 1 r <- 0.1 p <- 0.91 gam_nb <- 0.2381641 prior_K_nb <- AM_prior_K_NegBin(n,gam_nb,r,p) plot(prior_K_nb)
This function computes the prior on the number of clusters, i.e. occupied components of the mixture for a Finite Dirichlet process when the prior on the component-weights of the mixture is a
Dirichlet with parameter gamma
(i.e. when unnormalized weights are distributed as Gamma(,1)). This function can be used when the prior on the number of
components is Shifted Poisson of parameter
Lambda
. See (Argiento and Iorio 2019) for more details.
AM_prior_K_Pois(n, gamma, Lambda)
AM_prior_K_Pois(n, gamma, Lambda)
n |
The sample size. |
gamma |
The |
Lambda |
The |
There are no default values.
an AM_prior
object, that is a vector of length n, reporting the values of the prior on the number of clusters induced by the prior on M
and w
, i.e. p^*_k
for k=1,...,n
. See (Argiento and Iorio 2019) for more details.
n <- 82 Lambda <- 10 gam_po <- 0.1550195 prior_K_po <- AM_prior_K_Pois(n,gam_po,Lambda) plot(prior_K_po)
n <- 82 Lambda <- 10 gam_po <- 0.1550195 prior_K_po <- AM_prior_K_Pois(n,gam_po,Lambda) plot(prior_K_po)
Heuristic partitioning to minimise the expected loss function
with respect to a given expected adjacency matrix. This function is built upon R-package salso's implementation of the
salso
function. See salso (Dahl et al. 2021) for more details.
AM_salso( eam, loss, maxNClusters = 0, nRuns = 16, maxZealousAttempts = 10, probSequentialAllocation = 0.5, nCores = 0 )
AM_salso( eam, loss, maxNClusters = 0, nRuns = 16, maxZealousAttempts = 10, probSequentialAllocation = 0.5, nCores = 0 )
eam |
a co-clustering/ clustering matrix. See salso for more information on which matrix to use. |
loss |
the recommended loss functions to be used are the "binder" or "VI". However, other loss functions that are supported can be found in the R-package salso's salso function. |
maxNClusters |
Maximum number of clusters to be considered. The actual number of clusters searched may be lower. Default is 0. |
nRuns |
Number of runs to try. |
maxZealousAttempts |
Maximum number of tries for zealous updates. See salso for more information. |
probSequentialAllocation |
The probability of using sequential allocation instead of random sampling via sample(1:K,ncol(x),TRUE), where K is maxNClusters. Default is 0.5. See salso for more information. argument. |
nCores |
Number of CPU cores to engage. Default is 0. |
A numeric vector describing the estimated partition. The integer values represent the cluster labels of each item respectively.
David B. Dahl and Devin J. Johnson and Peter Müller (2021). salso: Search Algorithms and Loss Functions for Bayesian Clustering. R package version 0.2.15.
AntMAN: Anthology of Mixture ANalysis tools AntMan is an R package fitting Finite Bayesian Mixture models with a random number of components. The MCMC algorithm behind AntMAN is based on point processes and offers a more computationally efficient alternative to the Reversible Jump. Different mixture kernels can be specified: univariate Gaussian, multivariate Gaussian, univariate Poisson, and multivariate Bernoulli (Latent Class Analysis). For the parameters characterising the mixture kernel, we specify conjugate priors, with possibly user specified hyper-parameters. We allow for different choices on the prior on the number of components: Shifted Poisson, Negative Binomial, and Point Masses (i.e. mixtures with fixed number of components).
The main function of the AntMAN package is AM_mcmc_fit
. AntMAN performs a Gibbs sampling in order to fit,
in a Bayesian framework, a mixture model of a predefined type mix_kernel_hyperparams
given a sample y
.
Additionally AntMAN allows the user to specify a prior on the number of components mix_components_prior
and on the weights mix_weight_prior
of the mixture.
MCMC parameters mcmc_parameters
need to be given as argument for the Gibbs sampler (number of interations, burn-in, ...).
Initial values for the number of clusters (init_K
) or a specific clustering allocation (init_clustering
) can also be user-specified.
Otherwise, by default, we initialise each element of the sample y
to a different cluster allocation. This choice can be computationally inefficient.
For example, in order to identify clusters over a population of patients given a set of medical assumptions:
mcmc = AM_mcmc_parameters(niter=20000) mix = AM_mix_hyperparams_multiber () fit = AM_mcmc_fit (mix, mcmc) summary (fit)
In this example AM_mix_hyperparams_multiber
is one of the possible mixtures to use.
AntMAN currently support four different mixtures :
AM_mix_hyperparams_unipois(alpha0, beta0) AM_mix_hyperparams_uninorm(m0, k0, nu0, sig02) AM_mix_hyperparams_multiber(a0, b0) AM_mix_hyperparams_multinorm(mu0, ka0, nu0, Lam0)
Additionally, three types of kernels on the prior number of components are available:
AM_mix_components_prior_pois() AM_mix_components_prior_negbin() AM_mix_components_prior_dirac()
For example, in the context of image segmentation, if we know that there are 10 colours present, a prior dirac can be used :
mcmc = AM_mcmc_parameters(niter=20000) mix = AM_mix_hyperparams_multinorm () prior_component = AM_mix_components_prior_dirac(10) # 10 colours present fit = AM_mcmc_fit (mix, prior_component, mcmc) summary (fit)
Picture of brain activities from a teenager consuming drugs.
brain
brain
A list that contains dim
a (W:width,H:height) pair, and pic
a data frame (W*H pixels image in RGB format).
https://www.flickr.com/photos/nida-nih/29741916012
Crowley TJ, Dalwani MS, Mikulich-Gilbertson SK, Young SE, Sakai JT, Raymond KM, et al. (2015) Adolescents' Neural Processing of Risky Decisions: Effects of Sex and Behavioral Disinhibition. PLoS ONE 10(7): e0132322. doi:10.1371/journal.pone.0132322
data(brain)
data(brain)
The carcinoma data from Agresti (2002, 542) consist of seven dichotomous variables representing the ratings by seven pathologists of 118 slides on the presence or absence of carcinoma. The purpose of studying this data is to model "interobserver agreement" by examining how subjects might be divided into groups depending upon the consistency of their diagnoses.
carcinoma
carcinoma
A data frame with 118 rows and 7 variables (from A to G).
Agresti A (2002). Categorical Data Analysis. John Wiley & Sons, Hoboken.
data(carcinoma)
data(carcinoma)
This data set considers the physical information of velocities (10^3 km/second) for 82 galaxies reported by Roeder (1990). These are drawn from six well-separated conic sections of the Corona Borealis region.
galaxy
galaxy
A data frame with X rows and Y variables.
A numeric vector giving the speed of galaxies (1000*(km/second))
Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies, Journal of the American Statistical Association, 85: 617-624.
data(galaxy)
data(galaxy)
Given an AM_mcmc_output
object, this function plots some useful information about the MCMC results
regarding and
. Besides the PMFs, some of the diagnostic plots of the MCMC chain are visualised.
## S3 method for class 'AM_mcmc_output' plot(x, ...)
## S3 method for class 'AM_mcmc_output' plot(x, ...)
x |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.
plot the prior on the number of clusters for a given AM_prior
object.
## S3 method for class 'AM_prior' plot(x, ...)
## S3 method for class 'AM_prior' plot(x, ...)
x |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.
Usage frequency of the word "said" in the Brown corpus
said
said
A list with 500 observations on the frequency of said in different texts.
https://www.kaggle.com/nltkdata/brown-corpus
Francis, W., and Kucera, H. (1982) Frequency Analysis of English Usage, Houghton Mifflin Company, Boston.
data(said)
data(said)
Given an AM_mcmc_configuration
object, this function prints the summary information
of the specified mcmc configuration.
## S3 method for class 'AM_mcmc_configuration' summary(object, ...)
## S3 method for class 'AM_mcmc_configuration' summary(object, ...)
object |
an |
... |
all additional parameters are ignored |
NULL. Called for side effects.
Given an AM_mcmc_output
object, this function prints the summary information
pertaining to the given model output.
## S3 method for class 'AM_mcmc_output' summary(object, ...)
## S3 method for class 'AM_mcmc_output' summary(object, ...)
object |
a |
... |
all additional parameters are ignored |
NULL. Called for side effects.
Given an AM_mix_components_prior
object, this function prints the summary information
of the specified prior on the number of components.
## S3 method for class 'AM_mix_components_prior' summary(object, ...)
## S3 method for class 'AM_mix_components_prior' summary(object, ...)
object |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.
Given an AM_mix_hyperparams
object, this function prints the summary information
of the specified mixture hyperparameters.
## S3 method for class 'AM_mix_hyperparams' summary(object, ...)
## S3 method for class 'AM_mix_hyperparams' summary(object, ...)
object |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.
Given an AM_mix_weights_prior
object, this function prints the summary information
of the specified mixture weights prior.
## S3 method for class 'AM_mix_weights_prior' summary(object, ...)
## S3 method for class 'AM_mix_weights_prior' summary(object, ...)
object |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.
Given an AM_prior
object, this function prints the summary information of the specified prior on the number of clusters.
## S3 method for class 'AM_prior' summary(object, ...)
## S3 method for class 'AM_prior' summary(object, ...)
object |
an |
... |
all additional parameters are ignored. |
NULL. Called for side effects.