Title: | Search Algorithms and Loss Functions for Bayesian Clustering |
---|---|
Description: | The SALSO algorithm is an efficient randomized greedy search method to find a point estimate for a random partition based on a loss function and posterior Monte Carlo samples. The algorithm is implemented for many loss functions, including the Binder loss and a generalization of the variation of information loss, both of which allow for unequal weights on the two types of clustering mistakes. Efficient implementations are also provided for Monte Carlo estimation of the posterior expected loss of a given clustering estimate. See Dahl, Johnson, Müller (2022) <doi:10.1080/10618600.2022.2069779>. |
Authors: | David B. Dahl [aut, cre] , Devin J. Johnson [aut] , Peter Müller [aut], Alex Crichton [ctb] (Rust crates: cfg-if, proc-macro2), Brendan Zabarauskas [ctb] (Rust crate: approx), David B. Dahl [ctb] (Rust crates: dahl-bellnumber, dahl-partition, dahl-salso, roxido, roxido_macro), David Tolnay [ctb] (Rust crates: proc-macro2, quote, syn, unicode-ident), Jim Turner [ctb] (Rust crate: ndarray), Josh Stone [ctb] (Rust crate: autocfg), R. Janis Goldschmidt [ctb] (Rust crate: matrixmultiply), Sean McArthur [ctb] (Rust crate: num_cpus), Stefan Lankes [ctb] (Rust crate: hermit-abi), The Cranelift Project Developers [ctb] (Rust crate: wasi), The CryptoCorrosion Contributors [ctb] (Rust crates: ppv-lite86, rand_chacha), The Rand Project Developers [ctb] (Rust crates: getrandom, rand, rand_chacha, rand_core, rand_pcg), The Rust Project Developers [ctb] (Rust crates: libc, num-bigint, num-complex, num-integer, num-traits, rand, rand_chacha, rand_core), Ulrik Sverdrup "bluss" [ctb] (Rust crate: ndarray), bluss [ctb] (Rust crates: matrixmultiply, rawpointer) |
Maintainer: | David B. Dahl <[email protected]> |
License: | MIT + file LICENSE | Apache License 2.0 |
Version: | 0.3.42 |
Built: | 2024-11-24 14:50:06 UTC |
Source: | CRAN |
These functions compute the Bell number (the number of partitions of a given number of items) or its natural logarithm.
bell(nItems) lbell(nItems)
bell(nItems) lbell(nItems)
nItems |
The size of the set |
A numeric vector of length one giving the Bell number or its natural logarithm.
bell(12) lbell(300) all.equal( bell(5), exp(lbell(5)) )
bell(12) lbell(300) all.equal( bell(5), exp(lbell(5)) )
This function provides a partition to a subset of items which has high marginal probability based on samples from a partition distribution using the CHiPS greedy search method (Dahl, Page, Barrientos, 2024).
chips( x, threshold = 0, nRuns = 64, intermediateResults = TRUE, allCandidates = FALSE, nCores = 0 )
chips( x, threshold = 0, nRuns = 64, intermediateResults = TRUE, allCandidates = FALSE, nCores = 0 )
x |
A |
threshold |
The minimum marginal probability for the partial partition. Values closer to 1.0 will yield a partition of fewer items and values closer to 0.0 will yield a partition of more items. |
nRuns |
The number of runs to try, where the best result is returned. |
intermediateResults |
Should intermediate subset partitions be returned? |
allCandidates |
Should all the final subset partitions from multiple runs be returned? |
nCores |
The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system. |
If intermediateResults
is FALSE
, an integer vector giving the
estimated subset partition, encoded using cluster labels with -1
indicating not allocated. If TRUE
, a matrix with intermediate subset
partitions in the rows.
# For examples, use 'nCores = 1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings chips(draws, threshold = 0, nRuns = 1) chips(draws, nCores = 1)
# For examples, use 'nCores = 1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings chips(draws, threshold = 0, nRuns = 1) chips(draws, nCores = 1)
This function provides a partition to summarize a partition distribution
using the draws-based latent structure optimization (DLSO) method, which is
also known as the least-squares clustering method (Dahl 2006). The method
seeks to minimize an estimation criterion by picking the minimizer among the
partitions supplied. The implementation currently supports the minimization
of several partition estimation criteria. For details on these criteria, see
partition.loss
.
dlso(truth, loss = VI(), estimate = NULL)
dlso(truth, loss = VI(), estimate = NULL)
truth |
An integer vector of cluster labels for |
loss |
The loss function to use, as indicated by |
estimate |
An integer vector of cluster labels having the same length as
|
An integer vector giving the estimated partition, encoded using cluster labels.
D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge University Press.
partition.loss
, psm
,
summary.salso.estimate
, salso
data(iris.clusterings) dlso(iris.clusterings, loss=VI()) dlso(iris.clusterings, loss=binder()) # Compute expected loss using all draws, but pick the best among the first 10. dlso(iris.clusterings, loss=VI(), estimate=iris.clusterings[1:10,])
data(iris.clusterings) dlso(iris.clusterings, loss=VI()) dlso(iris.clusterings, loss=binder()) # Compute expected loss using all draws, but pick the best among the first 10. dlso(iris.clusterings, loss=VI(), estimate=iris.clusterings[1:10,])
This function produces a matrix whose rows provide all possible partitions of
the set . These partitions are provided as cluster
labels, where two items are in the same subset (i.e., cluster) if their
labels are equal.
enumerate.partitions(nItems)
enumerate.partitions(nItems)
nItems |
The size of the set |
A matrix of integers, where each row is a partition encoded as a vector of cluster labels.
enumerate.partitions(5)
enumerate.partitions(5)
This function produces a matrix whose rows provide all possible permutations
of the set .
enumerate.permutations(nItems)
enumerate.permutations(nItems)
nItems |
The size of the set |
A matrix of integers, where each row is a permutation.
enumerate.permutations(5)
enumerate.permutations(5)
Randomly generated clusterings for the iris
dataset.
iris.clusterings
iris.clusterings
A 1000-by-150 matrix of 1000 randomly generated clusterings of the
150 observations in the iris
dataset.
Unknown.
Given two partitions and
, these functions compute the
specified loss when using
to estimate
. Smaller loss
values indicate higher concordance between partitions. These functions also
compute a Monte Carlo estimate of the expectation for the specified loss
based on samples or a pairwise similarity matrix. This function also supports
computing approximations to the expectation of several losses. Supported
criteria are described below. Some criteria only require the pairwise
similarity matrix (as computed, for example, by
psm
) whereas
others require samples from a partition distribution. For those criteria that
only need the pairwise similarity matrix, posterior samples can still be
provided in the truth
argument and the pairwise similarity matrix will
automatically be computed as needed.
partition.loss(truth, estimate, loss = VI()) binder(truth, estimate, a = 1) RI(truth, estimate) omARI(truth, estimate) omARI.approx(truth, estimate) ARI(truth, estimate) VI(truth, estimate, a = 1) VI.lb(truth, estimate) NVI(truth, estimate) ID(truth, estimate) NID(truth, estimate)
partition.loss(truth, estimate, loss = VI()) binder(truth, estimate, a = 1) RI(truth, estimate) omARI(truth, estimate) omARI.approx(truth, estimate) ARI(truth, estimate) VI(truth, estimate, a = 1) VI.lb(truth, estimate) NVI(truth, estimate) ID(truth, estimate) NID(truth, estimate)
truth |
An integer vector of cluster labels for |
estimate |
An integer vector of cluster labels having the same length as
|
loss |
The loss function to use, as indicated by |
a |
(Only used for Binder and VI loss) The argument |
The partition estimation criterion can be specified using the loss
argument, which is either a string or a result of calling the associated
functions. These losses are described below:
"binder"
Binder. Whereas high values of the Rand index
between
and
correspond to high concordance between the
partitions, the N-invariant Binder loss
for a partition
in
estimating
is
, meaning that low values
correspond to high concordance between the partitions. This package reports
the N-invariant Binder loss. The original Binder loss equals the N-invariant
Binder loss multiplied by
. Only the pairwise similarity matrix
is required for this loss, but samples can be provided. As originally
discussed by Binder (1978), two mistakes are possible: 1. Placing two items
in separate clusters when in truth they belong to the same cluster, and 2.
Placing two items in the same cluster when in truth they belong to separate
clusters. The default cost of the first mistake is one, but can be specified
with the argument
a
in . Without loss of generality, the cost of
the second mistake is set as
2-a
. For a discussion of general
weights, see Dahl, Johnson, and Müller (2021). For a discussion of the equal
weights case, see also Dahl (2006), Lau and Green (2007), Dahl and Newton
(2007), Fritsch and Ickstadt (2009), and Wade and Ghahramani (2018).
"omARI"
One Minus Adjusted Rand Index. Computes the expectation
of one minus the adjusted Rand index (Hubert and Arabie, 1985). Whereas high
values of the adjusted Rand index between and
correspond
to high concordance between the partitions, the loss associated with the
adjusted Rand index for a partition
in estimating
is one
minus the adjusted Rand index between the partitions, meaning that low values
correspond to high concordance between the partitions. Samples from a
partition distribution are required for this loss. See Fritsch and Ickstadt
(2009).
"omARI.approx"
Approximation of One Minus Adjusted Rand Index.
Computes the first-order approximation of the expectation of one minus the
adjusted Rand index. The adjusted Rand index involves a ratio and the
first-order approximation of the expectation is based on . Only the pairwise similarity matrix is required for this
criterion, but samples can be provided. See Fritsch and Ickstadt (2009).
"VI"
Variation of Information. Computes the expectation of the
(generalized) variation of information loss. Samples from a partition
distribution are required for this loss. See Meilă (2007), Wade and
Ghahramani (2018), and Rastelli and Friel (2018). The original variation of
information of Meilă (2007) has been extended to the generalized variation of
information of Dahl, Johnson, and Müller (2021) to allow for unequal
weighting of two possible mistakes: 1. Placing two items in separate clusters
when in truth they belong to the same cluster, and 2. Placing two items in
the same cluster when in truth they belong to separate clusters. The value
a
controls the cost of the first mistake and defaults to one, but can
be specified with the argument a
in . Without loss of generality,
the cost of the second mistake is controlled by
2-a
. See Dahl,
Johnson, Müller (2021).
"VI.lb"
Lower Bound of the Variation of Information. Computes the lower bound of the expectation of the variation of information loss, where the lower bound is obtained by Jensen's inequality. Only the pairwise similarity matrix is required for this criterion, but samples can be provided. See Wade and Ghahramani (2018).
"NVI"
Normalized Variation of Information. Computes the expectation of the normalized variation of information loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).
"ID"
Information Distance. Computes the expectation of the
information distance () loss. Samples from a partition
distribution are required for this loss. See Vinh, Epps, and Bailey (2010).
"NID"
Normalized Information Distance. Computes the expectation of the normalized information distance loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).
The functions RI
and ARI
are convenience
functions. Note that:
binder(p1, p2, a=1) = ( 1 - RI(p1, p2) )*(n-1)/n
omARI(p1, p2) = 1 - ARI(p1, p2)
A numeric vector.
W. M. Rand (1971), Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66: 846–850.
D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.
L. Hubert and P. Arabie (1985), Comparing Partitions. Journal of Classification, 2, 193–218.
D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge University Press.
J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558.
M. Meilă (2007), Comparing Clusterings - an Information Based Distance. Journal of Multivariate Analysis, 98: 873–895.
D. B. Dahl and M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.
A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.
N. X. Vinh, J. Epps, and J. Bailey (2010), Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance, Journal of Machine Learning Research, 11, 2837-2854.
S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.
R. Rastelli and N. Friel (2018), Optimal Bayesian estimators for latent variable cluster models. Statistics and Computing, 28, 1169-1186.
D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, doi:10.1080/10618600.2022.2069779.
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) partitions <- iris.clusterings[1:5,] all.equal(partition.loss(partitions, partitions, loss=binder(a=1.4)), binder(partitions, partitions, a=1.4)) all.equal(partition.loss(partitions, partitions, loss=omARI()), omARI(partitions, partitions)) all.equal(partition.loss(partitions, partitions, loss=VI(a=0.8)), VI(partitions, partitions, a=0.8)) truth <- iris.clusterings[1,] estimate <- iris.clusterings[2,] VI(truth, estimate, a=1.0) n <- length(truth) all.equal(binder(truth, estimate), ( 1 - RI(truth, estimate) ) * (n-1) / n) all.equal(omARI(truth, estimate), 1 - ARI(truth, estimate))
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) partitions <- iris.clusterings[1:5,] all.equal(partition.loss(partitions, partitions, loss=binder(a=1.4)), binder(partitions, partitions, a=1.4)) all.equal(partition.loss(partitions, partitions, loss=omARI()), omARI(partitions, partitions)) all.equal(partition.loss(partitions, partitions, loss=VI(a=0.8)), VI(partitions, partitions, a=0.8)) truth <- iris.clusterings[1,] estimate <- iris.clusterings[2,] VI(truth, estimate, a=1.0) n <- length(truth) all.equal(binder(truth, estimate), ( 1 - RI(truth, estimate) ) * (n-1) / n) all.equal(omARI(truth, estimate), 1 - ARI(truth, estimate))
This function produces one of four plots: 1. "heatmap"
: A heatmap
showing the pairwise allocation probabilities that items are clustered. 2.
"mds"
: A scatter plot using classical multidimensional scaling (also
known as principal coordinates analysis) with the exemplar (i.e., the most
representative observation) of each cluster emphasized. 3. "pairs"
:
Pairs plots of all the variables with the exemplar (i.e., the most
representative observation) of each cluster emphasized. 4.
"dendrogram"
: A dendrogram based on expected partition loss showing
the relationships among clusters when merging pairs of clusters such that the
increase in the expectation of the posterior loss is minimized.
## S3 method for class 'salso.summary' plot( x, type = c("heatmap", "mds", "pairs", "dendrogram")[1], data = NULL, showLabels = TRUE, showIDs = length(x$estimate) <= 50, cexAdjustment = 0.7, ... )
## S3 method for class 'salso.summary' plot( x, type = c("heatmap", "mds", "pairs", "dendrogram")[1], data = NULL, showLabels = TRUE, showIDs = length(x$estimate) <= 50, cexAdjustment = 0.7, ... )
x |
An object returned by |
type |
A string equal to |
data |
The data from which the partition estimation was obtained. This
is required when |
showLabels |
Should the cluster labels be shown in the plot when
|
showIDs |
Should the ID of the items be shown in the plot? |
cexAdjustment |
Scalar multiplier for adjust text size. |
... |
Arguments to be passed to methods, such as graphical parameters
(see |
NULL
, invisibly.
salso
, summary.salso.estimate
,
cmdscale
.
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings est <- salso(draws, nCores=1) summ <- summary(est) plot(summ, type="heatmap") plot(summ, type="mds") plot(summ, type="pairs", data=iris) plot(summ, type="dendrogram")
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings est <- salso(draws, nCores=1) summ <- summary(est) plot(summ, type="heatmap") plot(summ, type="mds") plot(summ, type="pairs", data=iris) plot(summ, type="dendrogram")
If only one sample is provided, this function computes an adjacency matrix,
i.e., a binary matrix whose element is one if and only if
elements
and
in the partition have the same cluster label. If
multiple samples are provided (as rows of the
x
matrix), this function
computes the -by-
matrix whose
element gives the
relative frequency (i.e., estimated probability) that items
and
are in the same subset (i.e., cluster). This is the mean of the
adjacency matrices of the provided samples.
psm(x, nCores = 0)
psm(x, nCores = 0)
x |
A |
nCores |
The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system. |
A -by-
symmetric matrix whose
element gives
the relative frequency that that items
and
are in the same
subset (i.e., cluster).
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) partition <- iris.clusterings[1,] psmatrix <- psm(partition, nCores=1) psmatrix[1:6, 1:6] dim(iris.clusterings) probs <- psm(iris.clusterings, nCores=1) dim(probs) probs[1:6, 1:6]
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) partition <- iris.clusterings[1,] psmatrix <- psm(partition, nCores=1) psmatrix[1:6, 1:6] dim(iris.clusterings) probs <- psm(iris.clusterings, nCores=1) dim(probs) probs[1:6, 1:6]
This function provides a partition to summarize a partition distribution
using the SALSO greedy search method (Dahl, Johnson, and Müller, 2022). The
implementation currently supports the minimization of several partition
estimation criteria. For details on these criteria, see
partition.loss
.
salso( x, loss = VI(), maxNClusters = 0, nRuns = 16, maxZealousAttempts = 10, probSequentialAllocation = 0.5, nCores = 0, ... )
salso( x, loss = VI(), maxNClusters = 0, nRuns = 16, maxZealousAttempts = 10, probSequentialAllocation = 0.5, nCores = 0, ... )
x |
A |
loss |
The loss function to use, as indicated by |
maxNClusters |
The maximum number of clusters that can be considered by
the optimization algorithm, which has important implications for the
interpretability of the resulting clustering and can greatly influence the
RAM needed for the optimization algorithm. If the supplied value is zero
and |
nRuns |
The number of runs to try, although the actual number may differ
for the following reasons: 1. The actual number is a multiple of the number
of cores specified by the |
maxZealousAttempts |
The maximum number of attempts for zealous updates, in which entire clusters are destroyed and items are sequentially reallocated. While zealous updates may be helpful in optimization, they also take more CPU time which might be better used trying additional runs. |
probSequentialAllocation |
For the initial allocation, the probability
of sequential allocation instead of using |
nCores |
The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system. |
... |
Extra arguments not intended for the end user, including: 1.
|
An integer vector giving the estimated partition, encoded using cluster labels.
D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, doi:10.1080/10618600.2022.2069779.
partition.loss
, psm
,
summary.salso.estimate
, dlso
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings salso(draws, loss=VI(), nRuns=1, nCores=1) salso(draws, loss=VI(a=0.7), nRuns=1, nCores=1) salso(draws, loss=binder(), nRuns=1, nCores=1) salso(iris.clusterings, binder(a=NULL), nRuns=4, nCores=1) salso(iris.clusterings, binder(a=list(nClusters=3)), nRuns=4, nCores=1)
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings salso(draws, loss=VI(), nRuns=1, nCores=1) salso(draws, loss=VI(a=0.7), nRuns=1, nCores=1) salso(draws, loss=binder(), nRuns=1, nCores=1) salso(iris.clusterings, binder(a=NULL), nRuns=4, nCores=1) salso(iris.clusterings, binder(a=list(nClusters=3)), nRuns=4, nCores=1)
Assessing the quality of clusters in a partition estimate is added by this
function. The result can then be plotted with
plot.salso.summary
. The current implementation of the
calculation of these summaries is not terribly efficient and may be improved
in the future.
## S3 method for class 'salso.estimate' summary(object, alternative, orderingMethod = 1, ...)
## S3 method for class 'salso.estimate' summary(object, alternative, orderingMethod = 1, ...)
object |
An object returned by the |
alternative |
An optional argument specifying an alternative clustering
to use instead of that provided by |
orderingMethod |
An integer giving method to use to order the
observations for a heatmap plot. Currently values |
... |
Currently ignored. |
A list containing the estimate, the pairwise similarity matrix, the mean pairwise similarity matrix, the score and mean pairwise similarity for each observation, exemplar observation for each cluster, a dendrogram object, a vector for ordering observations in the heatmap plot, the size of each cluster, and the number of clusters.
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings est <- salso(draws, nCores=1) summ <- summary(est) plot(summ, type="heatmap") plot(summ, type="mds") plot(summ, type="pairs", data=iris) plot(summ, type="dendrogram")
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this. data(iris.clusterings) draws <- iris.clusterings est <- salso(draws, nCores=1) summ <- summary(est) plot(summ, type="heatmap") plot(summ, type="mds") plot(summ, type="pairs", data=iris) plot(summ, type="dendrogram")