Title: | Clustering via Quadratic Scoring |
---|---|
Description: | Performs tuning of clustering models, methods and algorithms including the problem of determining an appropriate number of clusters. Validation of cluster analysis results is performed via quadratic scoring using resampling methods, as in Coraggio, L. and Coretto, P. (2023) <doi:10.1016/j.jmva.2023.105181>. |
Authors: | Luca Coraggio [cre, aut] (Homepage: <https://luca-coraggio.com>), Pietro Coretto [aut] (Homepage: <https://pietro-coretto.github.io>) |
Maintainer: | Luca Coraggio <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2 |
Built: | 2024-12-06 18:40:31 UTC |
Source: | CRAN |
Data from Tables 1.1 and 1.2 (pp. 5-8) of Flury and Riedwyl (1988). There are six measurements made on 200 Swiss banknotes (the old-Swiss 1000-franc). The banknotes belong to two classes of equal size: genuine and counterfeit.
data(banknote)
data(banknote)
A data.frame
of dimension 200x7
with the following variables:
a factor
with classes: genuine
, counterfeit
Length of bill (mm)
Width of left edge (mm)
Width of right edge (mm)
Bottom margin width (mm)
Top margin width (mm)
Length of diagonal (mm)
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall.
Estimates the expected quadratic score for clustering solutions provided by a list of of candidate model, methods or algorithmic setting.
bqs(data, methodset, B = 10, type = "smooth", oob = FALSE, ncores = detectCores() - 2, alpha = 0.05, rankby = ifelse(B == 0, "mean", "lq"), boot_na_share = 0.25, savescores = FALSE, saveparams = FALSE)
bqs(data, methodset, B = 10, type = "smooth", oob = FALSE, ncores = detectCores() - 2, alpha = 0.05, rankby = ifelse(B == 0, "mean", "lq"), boot_na_share = 0.25, savescores = FALSE, saveparams = FALSE)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features.
Categorical variables and |
methodset |
a list of functions. A function in the list takes |
B |
a integer |
type |
character string specifying the type of score,
Possibile values are |
oob |
logical or character string specifying if out-of-bag bootstrap
is performed.
Possibile values are |
ncores |
an integer, it defines the number of cores used for parallel computing (see Details). |
alpha |
a number in |
rankby |
character string specifying how the scored solution are ranked.
Possible values are |
boot_na_share |
a numeric value in |
savescores |
logical, if |
saveparams |
logical, if |
The function implements the estimation and selection of an
appropriate clustering based on the methodology proposed in Coraggio
and Coretto (2023). In addition, we add the possibility of obtaining
score estimates using out-of-bag-boorstrap sampling alongside the
empirical bootstrap-based estimates proposed in the aforementioned
paper. Note that the out-of-bag-boorstrap estimates are obtained
using the same samples used for the emprical bootsrap, therefore,
oob=TRUE
add a small computational cost.
Choice of B
. In theory B
should be as large as
possible, however, if the list of methods is large and the
computational capacity is modest, a large B
may require
long run times. Coraggio and Coretto (2023) show experiments where
changing from B=1000
to B=100
introduces a marginal
increase in variability. B=100
should be considered as a
lower bound. In the case where one has very large method lists,
high-dimensional datasets and demanding methods, a possible strategy
to reduce the computational cost is as follows:
set a small value of B
, e.g., B=50
or even less.
Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.
Narrow down the methodset
list and repeat the
bootstrap estimation with a value of B
that is as large as
possible relative to available computational resources.
Parallel computing. Bootstrap sampling is performed using
foreach
-based parallel computation via the doParallel
parallel backend. Note that depending on the system settings, and
how the functions in methodset
make use of parallelism and/or
multi-threading computing, increasing ncores
may not produce
the desired reduction in computing time. For instance, this happens
when using linear algebra routines optimized for multi-threaded
computing (e.g., OpenBLAS, Intel Math Kernel Library (MKL), and so
on). These optimized shared libraries already implement
multi-threading, and it is necessary to find the optimal trade-off
between distributing processes over physical cores and
multi-threading at the logical unit level of the same physical unit.
Unfortunately, there is no universal recipe and much depends on
hardware architectures, operating system, shared libraries, etc.
We obtained the best results using OpenBLAS (the tagged
serial) and setting ncores=
the number of physical
cores.
methodset
argument. The methodset
argument
allows in input a function
, list
, or output from mset
functions: mset_user
, mset_gmix
, mset_kmeans
,
mset_pam
. It is also possible give any combination of these,
concatenated with the mbind
function. When passing a
function
, either as a single element or in a list, this must
take the data set as its first argument, and must return in output
at least list
named "params", conforming with the
return value of clust2params
, i.e. a list containing
proportion
, mean
and cov
elements, representing
the estimated clusters' parameters.
An S3 object of class bqs
. Output components are as follows:
smooth |
data.frame returned if
|
hard |
data.frame returned if
|
obb_smooth |
data.frame returned if
|
obb_hard |
data.frame returned if
|
best_smooth |
Clustering produced by the best method, according the
specified |
best_hard |
clustering produced by the best method, according the
specified |
best_obb_smooth |
clustering produced by the best method, according the
specified |
best_obb_hard |
clustering produced by the best method, according the
specified |
data |
a list containing information about the input |
B |
number of bootstrap replicates. |
methodset |
The elements of |
rankby |
the ranking criterion. |
raw |
a list that allows tracing the bootstrap sampling in almost
every stage.
Let
|
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: doi:10.1016/j.jmva.2023.105181
mset_user
, mset_gmix
, mset_kmeans
,
pam
, mbind
, clust2params
# load data data("banknote") dat <- banknote[-1] ## set up methods ## see also help('mset_user') and related functions KM <- mset_kmeans(K = 3) GMIX <- mset_gmix(K=3, erc=c(1,100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq", ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) res ## Not run: # The following example is more realistic but may take time # ---------------------------------------------------------- # load data data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # set up Gaussian model-based clustering via library("mclust") # see examples in help('mset_user') require(mclust) mc_wrapper <- function(data, K, ...){ y <- Mclust(data, G = K, ...) y[["params"]] <- list(proportion = y$parameters$pro, mean = y$parameters$mean, cov = y$parameters$variance$sigma) return(y) } MC <- mset_user(fname = "mc_wrapper", K = 2:5, modelNames = c("EEI", "VVV")) # combine tuned methods mlist <- mbind(KM, GMIX, MC) # perform bootstrap # set 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 100, type = "both", rankby="lq", ncores=1, oob = TRUE, savescores = TRUE, saveparams = FALSE) res ## End(Not run)
# load data data("banknote") dat <- banknote[-1] ## set up methods ## see also help('mset_user') and related functions KM <- mset_kmeans(K = 3) GMIX <- mset_gmix(K=3, erc=c(1,100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq", ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) res ## Not run: # The following example is more realistic but may take time # ---------------------------------------------------------- # load data data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # set up Gaussian model-based clustering via library("mclust") # see examples in help('mset_user') require(mclust) mc_wrapper <- function(data, K, ...){ y <- Mclust(data, G = K, ...) y[["params"]] <- list(proportion = y$parameters$pro, mean = y$parameters$mean, cov = y$parameters$variance$sigma) return(y) } MC <- mset_user(fname = "mc_wrapper", K = 2:5, modelNames = c("EEI", "VVV")) # combine tuned methods mlist <- mbind(KM, GMIX, MC) # perform bootstrap # set 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 100, type = "both", rankby="lq", ncores=1, oob = TRUE, savescores = TRUE, saveparams = FALSE) res ## End(Not run)
Ranks the scores of clusters methods estimated via boostrap
bqs_rank(bqsol, rankby = "lq", boot_na_share = 0.25)
bqs_rank(bqsol, rankby = "lq", boot_na_share = 0.25)
bqsol |
an object of class |
rankby |
character string specifying how the scored solution are ranked.
Possible values are |
boot_na_share |
a numeric value in |
An S3 object of class bqs
. Output components are those of
bqs
. See Value in bqs
.
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: doi:10.1016/j.jmva.2023.105181
# load data data("banknote") dat <- banknote[-1] ## set up methods ## see also help('mset_user') and related functions KM <- mset_kmeans(K = 3) GMIX <- mset_gmix(K=3, erc=c(1,100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, rankby="lq", ncores=1) res # now change ranking criterion res2 <- bqs_rank(res, rankby="mean") res2
# load data data("banknote") dat <- banknote[-1] ## set up methods ## see also help('mset_user') and related functions KM <- mset_kmeans(K = 3) GMIX <- mset_gmix(K=3, erc=c(1,100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, rankby="lq", ncores=1) res # now change ranking criterion res2 <- bqs_rank(res, rankby="mean") res2
Select solutions from a bqs
object based on specified rank and type of score.
bqs_select(bqs_sol, rank = 1, type = "smooth", rankby = NA, boot_na_share = 0.25)
bqs_select(bqs_sol, rank = 1, type = "smooth", rankby = NA, boot_na_share = 0.25)
bqs_sol |
An object of class |
rank |
An integer |
type |
A character string specifying the type of Quadratic Score. Possible values are |
rankby |
A character string specifying the criteria used to rank solutions in |
boot_na_share |
A numeric value between (0, 1). Clustering solutions in |
Even if the bqs_sol
object is not pre-ranked, the user may specify a ranking criterion to rank clustering solutions dynamically using the rankby
argument; this does not influence the bqs_sol
ranking. In these instances, the user can also specify boot_na_share
as in bqs_rank
to exclude solutions based on the proportion of unsuccessful bootstrap estimations. If rankby=NA
, the bqs_sol
must be pre-ranked.
A named list of all clustering solutions achieving a type
score of rank rank
when ranked according to rankby
criterion, or NULL
if no such solution is available in the bqs_sol
object. List names correspond to methods' names, and each named entry contains the corresponding clustering method in bqs_sol$methodlist
fit on bqs_sol$data
.
# Load data and set seet set.seed(123) data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # se 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 20, type = "both", rankby=NA, ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) # Methods are not ranked; this will raise an error try(bqs_select(res, rank = 1)) # Rank method dynamically ranked_res <- bqs_select(res, rank = 2, rankby = "lq", boot_na_share = 0.25) names(ranked_res)
# Load data and set seet set.seed(123) data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # se 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 20, type = "both", rankby=NA, ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) # Methods are not ranked; this will raise an error try(bqs_select(res, rank = 1)) # Rank method dynamically ranked_res <- bqs_select(res, rank = 2, rankby = "lq", boot_na_share = 0.25) names(ranked_res)
Transforms cluster labels into a list of parameters describing cluster size, mean, and dispersion.
clust2params(data, cluster)
clust2params(data, cluster)
data |
a numeric vector, matrix, or data frame of observations.
Rows correspond to observations and columns correspond to variables/features.
Categorical variables and |
cluster |
a vector of integers representing cluster labels. |
A list containing cluster parameters.
Let P=
number of variable/features and K=
number of clusters.
The elements of the list are as follows:
prop:
a vector of clusters' proportions;
mean:
a matrix of dimension (P x K)
containing the clusters' mean
parameters;
cov:
an array of size (P x P x K)
containing the clusters'
covariance matrices.
# load data data("banknote") # compute the k-means partition set.seed(2024) cl <- kmeans(banknote[-1], centers = 2, nstart = 1)$cluster # convert k-means hard assignment into cluster parameters clpars <- clust2params(banknote[-1], cl) clpars
# load data data("banknote") # compute the k-means partition set.seed(2024) cl <- kmeans(banknote[-1], centers = 2, nstart = 1)$cluster # convert k-means hard assignment into cluster parameters clpars <- clust2params(banknote[-1], cl) clpars
Fast implementation of the EM algorithm for ML estimation and clustering of Gaussian mixture models with covariance matrix regularization based on eigenvalue ratio constraints.
gmix( data, K = NA, erc = 50, iter.max = 1000, tol = 1e-8, init = "kmed", init.nstart = 25, init.iter.max = 30, init.tol = tol, save_cluster = TRUE, save_params = TRUE, save_taus = FALSE)
gmix( data, K = NA, erc = 50, iter.max = 1000, tol = 1e-8, init = "kmed", init.nstart = 25, init.iter.max = 30, init.tol = tol, save_cluster = TRUE, save_params = TRUE, save_taus = FALSE)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features. Let |
K |
the number of mixture components or clusters. It can be left
|
erc |
a numeric value |
iter.max |
maximum number of iterations for the EM algorithm. |
tol |
tolerance for the convergence of the EM algorithm. |
init |
a character in the set |
init.nstart |
number of initial partitions ((see Details)). |
init.iter.max |
maximum number of iterations for each run of the kmedian initialization. |
init.tol |
tolerance for the convergence of each ran of the kmedian initialization. |
save_cluster |
logical, if |
save_params |
logical, if |
save_taus |
logical, if |
The function implements the constrained ML estimator studied in
Coretto and Hennig (2023). The convariance matrix constraints are
computed according to the CM1-step
of Algorithm 2 of Coretto
and Hennig (2017). This function uses highly optimized C code for fast
execution. The constrained M-step extensively uses low-level common
linear algebra matrix operations (BLAS/LAPACK routines). Consequently,
to maximize computational efficiency, it is recommended that the best
available shared libraries, such as OpenBLAS, Intel Math Kernel
Library (MKL), etc., be set up.
Initialization.
The default method, set with init="kmed"
, uses fast C
implementation of the k-
medians algorithm with random initial
centers drawn uniformly over the data
rows init.iter.max
times. Depending on the computer power available
it is suggested to set init.iter.max
as large as possible
particularly in cases where the data set dimensionality is large in
terms of both sample size and number of features.
Setting init="kmeans"
one replaces the K-
medians with
the K-
means. With init="pam"
initial clusters are
determined using the PAM algorithm based on Euclidian distances. The
latter does not perform multiple starts.
The user can also set init = x
where x
is a vector of
integers of length N=nrow(data)
representing an initial hard
assignment of data points to the mixture components or clusters (see Examples).
Another possibility is to set init = W
where W
is a
matrix of dimension (N x {K})
containing the initial posterior
probabilities that the ith observation belongs to the
kth cluster. The assignment provided via W
can be hard
(0-
1 weights with the constraint that only a 1 is possible in
each row of W
, or smooth (each row of W
must sum up to
1). W
can be seen as the initial version of the object
tau
describied in the Value section below.
The last alternative is to set init = f(data)
. Here
f(data)
is a function
that takes data
as an input and returns the matrix with an
initial hard/smooth assignment as the W
matrix previously
described (see the example below).
Eigenvalue ratio constraint (erc
).
It is the maximum allowed ratio between within-cluster covariance
matrix eigenvalues.
It defines the so-
called eigenratio constraint.
erc=1
enforces spherical clusters with equal covariance matrices.
A large erc
allows for large between-cluster covariance discrepancies.
It is suggested to never set erc
arbitrarily large, its main role is to
prevent degenerate covariance parameters and the related emergence of spurious
clusters (see Referenceses below).
Finally, in order to facilitate the setting of erc
, it is
suggested to scale the columns of data
whenever measurement
units of the different variables
are grossly incompatible.
An S3 object of class 'mbcfit'
. Output components are as follows:
info |
information on convergence and errors (see notes). |
iter |
number of iterations performed in the underlying EM-algorithm. |
N |
number of data points. |
P |
data dimension. |
K |
number of clusters. |
eloglik |
sample expected log |
size |
cluster size (counts). |
cluster |
cluster assignment based on the maximum a posteriori rule (MAP). |
taus |
a matrix of dimension |
params |
a list containing mixture components parameters.
The elements of the list are as follows:
|
info |
a list with two components named giving information about underlysing
EM algorithm. The
The
|
Coretto, Pietro and Christian Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. URL: https://jmlr.org/papers/v18/16-382.html
Coretto, Pietro and Christian Hennig (2023) Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. arXiv:2311.06108. URL: https://arxiv.org/abs/2311.06108
# --- load data data("banknote") dat <- banknote[-1] n <- nrow(dat) #sample size nc <- 2 #number of clusters # fit 2 clusters using the default k-median initialization # In real applications set 'init.nstart' as large as possibile set.seed(101) fit1 <- gmix(dat, K = nc, init.nstart = 1) print(fit1) # plot partition (default) plot(x = fit1, data = dat) # plot partition onto the first 3 principal component coordinates plot(x = fit1, data = prcomp(dat)$x, margins = c(1,2,3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), main = "Principal Components") # user-defined random initialization with hard assignment labels set.seed(102) i2 <- sample(1:nc, size = n, replace = TRUE) fit2 <- gmix(dat, K = 2, init = i2) plot(x=fit2, data = dat) # user-defined smooth "toy" initialization: # 50% of the points are assigned to cluster 1 with probability 0.95 and to # cluster 2 with probability 5%. The remaining data points are assigned to # cluster 1 with probability 10% and to cluster 2 with probability 10% # set.seed(103) idx <- sample(c(TRUE, FALSE), size = n, replace = TRUE) i3 <- matrix(0, nrow = n, ncol = nc) i3[idx, ] <- c(0.9, 0.1) i3[!idx, ] <- c(0.1, 0.9) # fit fit3 <- gmix(dat, K = nc, init = i3) plot(x=fit3, data = dat) # user-defined function for initialization # this one produces a 0-1 hard posterior matrix W based on kmeans # compute_init <- function(data, K){ cl <- kmeans(data, K, nstart=1, iter.max=10)$cluster W <- sapply(seq(K), function(x) as.numeric(cl==x)) return(W) } fit4 <- gmix(dat, K = nc, init = compute_init) plot(fit4, data = dat)
# --- load data data("banknote") dat <- banknote[-1] n <- nrow(dat) #sample size nc <- 2 #number of clusters # fit 2 clusters using the default k-median initialization # In real applications set 'init.nstart' as large as possibile set.seed(101) fit1 <- gmix(dat, K = nc, init.nstart = 1) print(fit1) # plot partition (default) plot(x = fit1, data = dat) # plot partition onto the first 3 principal component coordinates plot(x = fit1, data = prcomp(dat)$x, margins = c(1,2,3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), main = "Principal Components") # user-defined random initialization with hard assignment labels set.seed(102) i2 <- sample(1:nc, size = n, replace = TRUE) fit2 <- gmix(dat, K = 2, init = i2) plot(x=fit2, data = dat) # user-defined smooth "toy" initialization: # 50% of the points are assigned to cluster 1 with probability 0.95 and to # cluster 2 with probability 5%. The remaining data points are assigned to # cluster 1 with probability 10% and to cluster 2 with probability 10% # set.seed(103) idx <- sample(c(TRUE, FALSE), size = n, replace = TRUE) i3 <- matrix(0, nrow = n, ncol = nc) i3[idx, ] <- c(0.9, 0.1) i3[!idx, ] <- c(0.1, 0.9) # fit fit3 <- gmix(dat, K = nc, init = i3) plot(x=fit3, data = dat) # user-defined function for initialization # this one produces a 0-1 hard posterior matrix W based on kmeans # compute_init <- function(data, K){ cl <- kmeans(data, K, nstart=1, iter.max=10)$cluster W <- sapply(seq(K), function(x) as.numeric(cl==x)) return(W) } fit4 <- gmix(dat, K = nc, init = compute_init) plot(fit4, data = dat)
The function combines functions containing clustering methods setups
built using mset_user
and related functions.
mbind(...)
mbind(...)
... |
one or more object of class |
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with arguments that are passed to the base function. |
fn |
the function implementing the specified setting. This |
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: doi:10.1016/j.jmva.2023.105181
mset_user
, mset_gmix
, mset_kmeans
, mset_pam
# load data data("banknote") dat <- banknote[-1] # generate kmeans setups A <- mset_kmeans(K=c(2,3)) # generate gmix setups B <- mset_gmix(K=c(2,3)) # combine setups M <- mbind(A, B) # get the PAM setting with K=3 m <- M[[4]] m # cluster data with M[[3]] fit <- m$fn(dat) fit
# load data data("banknote") dat <- banknote[-1] # generate kmeans setups A <- mset_kmeans(K=c(2,3)) # generate gmix setups B <- mset_gmix(K=c(2,3)) # combine setups M <- mbind(A, B) # get the PAM setting with K=3 m <- M[[4]] m # cluster data with M[[3]] fit <- m$fn(dat) fit
The function generates a software abstraction of a list of clustering
models implemented through a set of tuned methods and algorithms.
In particular, it generates a list ofgmix
-type functions each
combining model tuning parameters and other algorithmic settings.
The generated functions are ready to be called on the data set.
mset_gmix( K = seq(10), init = "kmed", erc = c(1, 50, 1000), iter.max = 1000, tol = 1e-8, init.nstart = 25, init.iter.max = 30, init.tol = tol)
mset_gmix( K = seq(10), init = "kmed", erc = c(1, 50, 1000), iter.max = 1000, tol = 1e-8, init.nstart = 25, init.iter.max = 30, init.tol = tol)
K |
a vector/list, specifies the number of clusters. |
init |
a vector, contains the settings of the |
erc |
a vector/list, contains the settings of the |
iter.max |
a integer vector, contains the settings of the |
tol |
a vector/list, contains the settings of the |
init.nstart |
a integer vector, contains the settings of the |
init.iter.max |
a integer vector, contains the settings of the |
init.tol |
a vector/list, contains the settings of the |
The function produces functions implementing competing clustering methods
based on several Gaussian Mixture models specifications.
The function produces functions for fitting competing Gaussian Mixture
model-based clustering methods settings.
This is a specialized version of the more general function
mset_user
.
In particular, it produces a list of gmix
functions each
corresponding to a specific setup in terms of both model
hyper-parameters (e.g. the number of clusters, the eigenvalue ratio
constraint, etc.) and algorithm's control parameters
(e.g. the type of initialization, maximum number of iteration,
etc.). See gmix
for a detailed description of
the role of each argument and their data types.
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with |
fn |
the function implementing the specified setting. This |
Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, doi:10.1016/j.jmva.2023.105181
# 'gmix' settings combining number of clusters K={3,4} and eigenvalue # ratio constraints {1,10} A <- mset_gmix(K = c(2,3), erc = c(1,10)) # select setup 1: K=2, erc = 1, init =" kmed" ma1 <- A[[1]] print(ma1) # fit M[[1]] on banknote data data("banknote") dat <- banknote[-1] fit1 <- ma1$fn(dat) fit1 # if only cluster parameters are needed fit1b <- ma1$fn(dat, only_params = TRUE) fit1b # include a custom initialization, see also help('gmix') compute_init <- function(data, K){ cl <- kmeans(data, K, nstart=1, iter.max=10)$cluster W <- sapply(seq(K), function(x) as.numeric(cl==x)) return(W) } # generate methods settings B <- mset_gmix(K = c(2,3), erc = c(1,10), init=c(compute_init, "kmed")) # select setup 2: K=2, erc=10, init = compute_init mb2 <- B[[2]] fit2 <- mb2$fn(dat) fit2
# 'gmix' settings combining number of clusters K={3,4} and eigenvalue # ratio constraints {1,10} A <- mset_gmix(K = c(2,3), erc = c(1,10)) # select setup 1: K=2, erc = 1, init =" kmed" ma1 <- A[[1]] print(ma1) # fit M[[1]] on banknote data data("banknote") dat <- banknote[-1] fit1 <- ma1$fn(dat) fit1 # if only cluster parameters are needed fit1b <- ma1$fn(dat, only_params = TRUE) fit1b # include a custom initialization, see also help('gmix') compute_init <- function(data, K){ cl <- kmeans(data, K, nstart=1, iter.max=10)$cluster W <- sapply(seq(K), function(x) as.numeric(cl==x)) return(W) } # generate methods settings B <- mset_gmix(K = c(2,3), erc = c(1,10), init=c(compute_init, "kmed")) # select setup 2: K=2, erc=10, init = compute_init mb2 <- B[[2]] fit2 <- mb2$fn(dat) fit2
The function generates a software abstraction of a list of clustering
models implemented through a set of tuned methods and algorithms.
In particular, it generates a list of
kmeans
-type functions each combining tuning
parameters and other algorithmic settings.
The generated functions are ready to be called on the data set.
mset_kmeans(K = c(1:10), iter.max = 50, nstart = 30, algorithm = "Hartigan-Wong", trace = FALSE)
mset_kmeans(K = c(1:10), iter.max = 50, nstart = 30, algorithm = "Hartigan-Wong", trace = FALSE)
K |
a vector, specifies the number of clusters. |
iter.max |
a vector, contains the settings of the |
nstart |
a vector, contains the settings of the |
algorithm |
a vector, contains the settings of the |
trace |
a vector, contains the settings of the |
The function produces functions implementing competing clustering methods
based on the K-Means methodology as implemented in
kmeans
.
This is a specialized version of the more general function
mset_user
.
In particular, it produces a list of kmeans
functions each
corresponding to a specific setup in terms of
hyper-parameters (e.g. the number of clusters) and algorithm's
control parameters (e.g. initialization).
See kmeans
for more detail for a detailed description of
the role of each argument and their data types.
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with |
fn |
the function implementing the specified setting. This |
Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, doi:10.1016/j.jmva.2023.105181
# 'pam' settings combining number of clusters K={2,3}, and dissimilarities {euclidean, manhattan} A <- mset_pam(K = c(2,3), metric = c("euclidean", "manhattan")) # select setup 1: K=2, metric = "euclidean" m <- A[[1]] print(m) # cluster with the method set in 'ma1' data("banknote") dat <- banknote[-1] fit1 <- m$fn(dat) fit1 class(fit1) # if only cluster parameters are needed fit2 <- m$fn(dat, only_params = TRUE) fit2
# 'pam' settings combining number of clusters K={2,3}, and dissimilarities {euclidean, manhattan} A <- mset_pam(K = c(2,3), metric = c("euclidean", "manhattan")) # select setup 1: K=2, metric = "euclidean" m <- A[[1]] print(m) # cluster with the method set in 'ma1' data("banknote") dat <- banknote[-1] fit1 <- m$fn(dat) fit1 class(fit1) # if only cluster parameters are needed fit2 <- m$fn(dat, only_params = TRUE) fit2
The function generates a software abstraction of a list of clustering
models implemented through the a set of tuned methods and algorithms.
In particular, it generates a list of pam
-type
functions each combining tuning parameters and other algorithmic settings.
The generated functions are ready to be called on the data set.
mset_pam(K = seq(10), metric = "euclidean", medoids = if (is.numeric(nstart)) "random", nstart = if (variant == "faster") 1 else NA, stand = FALSE, do.swap = TRUE, variant = "original", pamonce = FALSE)
mset_pam(K = seq(10), metric = "euclidean", medoids = if (is.numeric(nstart)) "random", nstart = if (variant == "faster") 1 else NA, stand = FALSE, do.swap = TRUE, variant = "original", pamonce = FALSE)
K |
a vector/list, specifies the number of clusters. |
metric |
a vector, contains the settings of the |
medoids |
list, contains the settings of the |
nstart |
a vector, contains the settings of the |
stand |
a vector, contains the settings of the |
do.swap |
a vector, contains the settings of the |
variant |
a list, contains the settings of the |
pamonce |
a vector, contains the settings of the |
The function produces functions implementing competing clustering methods
based on the PAM clustering methodology as implemented in
pam
.
This is a specialized version of the more general function
mset_user
.
In particular, it produces a list of pam
functions each
corresponding to a specific setup in terms of
hyper-parameters (e.g. the number of clusters) and algorithm's
control parameters (e.g. initialization).
See pam
for more detail for a detailed description of
the role of each argument and their data types.
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with |
fn |
the function implementing the specified setting. This |
Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, doi:10.1016/j.jmva.2023.105181
# 'pam' settings combining number of clusters K={2,3}, and dissimilarities {euclidean, manhattan} A <- mset_pam(K = c(2,3), metric = c("euclidean", "manhattan")) # select setup 1: K=2, metric = "euclidean" m <- A[[1]] print(m) # cluster with the method set in 'm' data("banknote") dat <- banknote[-1] fit1 <- m$fn(dat) fit1 class(fit1) # if only cluster parameters are needed fit1b <- m$fn(dat, only_params = TRUE) fit1b
# 'pam' settings combining number of clusters K={2,3}, and dissimilarities {euclidean, manhattan} A <- mset_pam(K = c(2,3), metric = c("euclidean", "manhattan")) # select setup 1: K=2, metric = "euclidean" m <- A[[1]] print(m) # cluster with the method set in 'm' data("banknote") dat <- banknote[-1] fit1 <- m$fn(dat) fit1 class(fit1) # if only cluster parameters are needed fit1b <- m$fn(dat, only_params = TRUE) fit1b
The function generates a software abstraction of a list of clustering models implemented through the a set of tuned methods and algorithms. The base clustering methodology is provided via a user-defined function. The latter prototype is exapanded in a list of fucntions each combining tuning parameters and other algorithmic settings. The generated functions are ready to be called on the data set.
mset_user(fname, .packages = NULL, .export = NULL, ...)
mset_user(fname, .packages = NULL, .export = NULL, ...)
fname |
a function implementing a user-defined clustering method. It clusters
a data set and outputs cluster parameters. |
.packages |
character vector of packages that the tasks in |
.export |
character vector of variables to export that are needed by
|
... |
parameters passed to |
The function produces functions implementing competing clustering methods
based on a prototype methodology implemented by the user via
the input argument fname
.
In particular, it builds a list of fname
-type functions each
corresponding to a specific setup in terms of
hyper-parameters (e.g. the number of clusters) and algorithm's
control parameters (e.g. initialization).
Requirements for fname
.
fname
is a function implementing the base clustering method of
interest. It must have the following input argument
data:
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features.
Categorical variables and NA
values are not allowed.
Additionally, fname
can have any other input parameter controlling
the underlying clustering model/method/algorithm. All this additional
parameters are passed to mset_user
via ...
(see Arguments).
The output of fname
must contain a list named params
with cluster parameters describing size, centrality and scatter.
Let P=
number of variable/features and K=
number of clusters.
The elements of params
are as follows:
prop:
a vector of clusters' proportions;
mean:
a matrix of dimension (P x K)
containing the clusters' mean
parameters;
cov:
an array of size (P x P x K)
containing the clusters'
covariance matrices.
Note that params
can be easily obtained from a vector of cluster labels
using clust2params
.
packages
and export
. The user does not
normally need to specify packages
and export
.
These arguments are not needed if the functions generated by mset_user
will be called from an environment containing all variables and
functions needed to execute fname
.
Functions like bqs
will call the functions
by mset_user
within a parallel infrastructure
using foreach
. If the user specifies
packages
and export
, they will be passed to the
.packages
and .export
arguments of
foreach
.
Finally, note that the package already contains specialized versions of mset_user
generating methods settings for some popular algorithms
(see mset_gmix
, mset_kmeans
, mset_pam
)
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with arguments that are passed to the base function. |
fn |
the function implementing the specified setting. This |
Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, doi:10.1016/j.jmva.2023.105181
clust2params
, mset_gmix
, mset_kmeans
, mset_pam
# load data data("banknote") dat <- banknote[-1] # EXAMPLE 1: generate Hierarchical Clustering settings # ---------------------------------------------------- # wrapper for the popular stats::hclust() for Hierarchical Clustering # Note the usee: # of the optional arguments '...' passed to the underling clustering function # the use of 'clust2params' to add cluster parameters to the output hc_wrapper <- function(data, K, ...){ dm <- dist(data, method = "euclidean") ## ... = hc parameters hc <- hclust(dm, ...) cl <- cutree(hc, k = K) ## output with params res <- list() res$cluster <- cl res$params <- clust2params(data, cluster = cl) return(res) } # generate settings for Hierarchical Clustering with varying # number of clusters K={3,4}, agglomeration method = {ward.D, median} # see help('stats::hclust') A <- mset_user(fname="hc_wrapper", K = c(2,3), method = c("ward.D", "complete")) # get the setting with K=2 and method = "complete" ma <- A[[4]] ma # cluster data with M[[3]] fit_a1 <- ma$fn(dat) fit_a1 ## if only cluster parameters are needed fit_a2 <- ma$fn(dat, only_params = TRUE) fit_a2 ## Not run: # EXAMPLE 2: generate 'mclust' model settings # ------------------------------------------- # mclust is popular package for performing model based clustering based on # Gaussian mixture. Please visit # https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html require(mclust) # wrapper for the popular stats::hclust() for Hierarchical Clustering # Notes: # * optional arguments '...' are passed to the underling # 'mclust' clustering function # * 'mclust' fits Gaussian Mixture models so cluster parameters are # contained in the mclust object mc_wrapper <- function(data, K, ...){ y <- Mclust(data, G = K, ...) y[["params"]] <- list(proportion = y$parameters$pro, mean = y$parameters$mean, cov = y$parameters$variance$sigma) return(y) } # generate 'mclust' model settings by varying the number of clusters and # covariance matrix models (see help('mclust::mclustModelNames')) B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI", "VVV")) # get the setting with K=3 and covariance model "EEI" mb <- B[[2]] mb # cluster data with M[[3]] fit_b <- mb$fn(dat) fit_b ## class(fit_b) = "Mclust" # if needed one can make sure that 'mclust' package is always available # by setting the argument 'packages' B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI","VVV"), packages=c("mclust")) ## End(Not run) ## Not run: # EXAMPLE 3: generate 'dbscan' settings # ------------------------------------- # DBSCAN is popular nonparametric method for discovering clusters of # arbitrary shapes with noise. The number of clusters is implicitly # determined via two crucial tunings usually called 'eps' and 'minPts' # See https://en.wikipedia.org/wiki/DBSCAN require(dbscan) # wrapper for dbscan::dbscan db_wrap <- function(data, ...) { cl <- dbscan(data, borderPoints = TRUE, ...)$cluster return(params = clust2params(data, cl)) } D <- mset_user(fname = "db_wrap", eps = c(0.5, 1), minPts=c(5,10)) md <- D[[2]] fit_d <- md$fn(dat) fit_d class(fit_d) ## End(Not run)
# load data data("banknote") dat <- banknote[-1] # EXAMPLE 1: generate Hierarchical Clustering settings # ---------------------------------------------------- # wrapper for the popular stats::hclust() for Hierarchical Clustering # Note the usee: # of the optional arguments '...' passed to the underling clustering function # the use of 'clust2params' to add cluster parameters to the output hc_wrapper <- function(data, K, ...){ dm <- dist(data, method = "euclidean") ## ... = hc parameters hc <- hclust(dm, ...) cl <- cutree(hc, k = K) ## output with params res <- list() res$cluster <- cl res$params <- clust2params(data, cluster = cl) return(res) } # generate settings for Hierarchical Clustering with varying # number of clusters K={3,4}, agglomeration method = {ward.D, median} # see help('stats::hclust') A <- mset_user(fname="hc_wrapper", K = c(2,3), method = c("ward.D", "complete")) # get the setting with K=2 and method = "complete" ma <- A[[4]] ma # cluster data with M[[3]] fit_a1 <- ma$fn(dat) fit_a1 ## if only cluster parameters are needed fit_a2 <- ma$fn(dat, only_params = TRUE) fit_a2 ## Not run: # EXAMPLE 2: generate 'mclust' model settings # ------------------------------------------- # mclust is popular package for performing model based clustering based on # Gaussian mixture. Please visit # https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html require(mclust) # wrapper for the popular stats::hclust() for Hierarchical Clustering # Notes: # * optional arguments '...' are passed to the underling # 'mclust' clustering function # * 'mclust' fits Gaussian Mixture models so cluster parameters are # contained in the mclust object mc_wrapper <- function(data, K, ...){ y <- Mclust(data, G = K, ...) y[["params"]] <- list(proportion = y$parameters$pro, mean = y$parameters$mean, cov = y$parameters$variance$sigma) return(y) } # generate 'mclust' model settings by varying the number of clusters and # covariance matrix models (see help('mclust::mclustModelNames')) B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI", "VVV")) # get the setting with K=3 and covariance model "EEI" mb <- B[[2]] mb # cluster data with M[[3]] fit_b <- mb$fn(dat) fit_b ## class(fit_b) = "Mclust" # if needed one can make sure that 'mclust' package is always available # by setting the argument 'packages' B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI","VVV"), packages=c("mclust")) ## End(Not run) ## Not run: # EXAMPLE 3: generate 'dbscan' settings # ------------------------------------- # DBSCAN is popular nonparametric method for discovering clusters of # arbitrary shapes with noise. The number of clusters is implicitly # determined via two crucial tunings usually called 'eps' and 'minPts' # See https://en.wikipedia.org/wiki/DBSCAN require(dbscan) # wrapper for dbscan::dbscan db_wrap <- function(data, ...) { cl <- dbscan(data, borderPoints = TRUE, ...)$cluster return(params = clust2params(data, cl)) } D <- mset_user(fname = "db_wrap", eps = c(0.5, 1), minPts=c(5,10)) md <- D[[2]] fit_d <- md$fn(dat) fit_d class(fit_d) ## End(Not run)
This function plots data and optionally adds clustering information such as clustering assignments, contours, or boundaries.
plot_clustering(data, subset = NULL, cluster = NULL, params = NULL, what = c("clustering", "contour", "boundary"), col_cl = NULL, pch_cl = NULL)
plot_clustering(data, subset = NULL, cluster = NULL, params = NULL, what = c("clustering", "contour", "boundary"), col_cl = NULL, pch_cl = NULL)
data |
a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and |
subset |
A numeric vector indexing columns of |
cluster |
A vector of cluster assignments. If provided, the plot can display clustering information as specified in |
params |
A list of clustering parameters, including |
what |
Character vector specifying which elements to plot. Options are |
col_cl |
A vector of colors to use for clusters (one for each cluster). Default is |
pch_cl |
A vector of plotting symbols (one for each cluster) to use for clusters. Default is |
No return value, called for side effects
# Example data set.seed(123) data <- rbind( matrix(rnorm(100 * 2), ncol = 2), matrix(rnorm(100 * 2) + 2, ncol = 2) ) cluster <- c(rep(1, 100), rep(2, 100)) params <- clust2params(data, cluster) # Plot with clustering information plot_clustering(data, cluster = cluster, what = "clustering") # Plot with subset of variables plot_clustering(data, cluster = cluster, subset = 1, what = c("clustering", "contour")) # Plot with customized colors and symbols plot_clustering(data, cluster = cluster, params = params, col_cl = c("magenta", "orange"), pch_cl = c("A", "B"))
# Example data set.seed(123) data <- rbind( matrix(rnorm(100 * 2), ncol = 2), matrix(rnorm(100 * 2) + 2, ncol = 2) ) cluster <- c(rep(1, 100), rep(2, 100)) params <- clust2params(data, cluster) # Plot with clustering information plot_clustering(data, cluster = cluster, what = "clustering") # Plot with subset of variables plot_clustering(data, cluster = cluster, subset = 1, what = c("clustering", "contour")) # Plot with customized colors and symbols plot_clustering(data, cluster = cluster, params = params, col_cl = c("magenta", "orange"), pch_cl = c("A", "B"))
Produce a plot of bqs (Bootstrap Quadratic Scores). This function creates plots based on the BQS (Bootstrap Quality Scores) data.
## S3 method for class 'bqs' plot(x, score = NULL, perc_scale = FALSE, top = NULL, annotate = NULL, ...)
## S3 method for class 'bqs' plot(x, score = NULL, perc_scale = FALSE, top = NULL, annotate = NULL, ...)
x |
An S3 object of class |
score |
Character vector specifying the score(s) to be plotted. Valid scores are |
perc_scale |
Logical; if |
top |
Numeric; specifies the number of top models to individually highlight. Must be a single number less than or equal to the length of |
annotate |
Logical; if |
... |
Further arguments passed to or from other methods. |
A plot displaying the Bootstrap Quality Scores.
# load data data("banknote") dat <- banknote[-1] # set up methods mlist <- mset_gmix(K=1:3, erc=c(1,100)) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq", ncores = 1, oob = TRUE, savescores = FALSE, saveparams = FALSE) # Plot with default settings plot(res) # Plot in percentage scale relative to first model plot(res, perc_scale = TRUE)
# load data data("banknote") dat <- banknote[-1] # set up methods mlist <- mset_gmix(K=1:3, erc=c(1,100)) # perform bootstrap # change B and ncores to a much larger value in real problems res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq", ncores = 1, oob = TRUE, savescores = FALSE, saveparams = FALSE) # Plot with default settings plot(res) # Plot in percentage scale relative to first model plot(res, perc_scale = TRUE)
This function provides a plot method for objects of class mbcfit
, returned as output by the gmix
function. It serves as a wrapper around plot_clustering
, allowing easy visualization of clustering results, including clustering assignments, contours, and boundaries.
## S3 method for class 'mbcfit' plot(x, data = NULL, subset = NULL, what = c("clustering", "contour"), col_cl = NULL, pch_cl = NULL, ...)
## S3 method for class 'mbcfit' plot(x, data = NULL, subset = NULL, what = c("clustering", "contour"), col_cl = NULL, pch_cl = NULL, ...)
x |
An object of class |
data |
|
subset |
A numeric vector indexing columns of |
what |
Character vector specifying which elements to plot. Options are |
col_cl |
A vector of colors to use for clusters (one for each cluster). Default is |
pch_cl |
A vector of plotting symbols (one for each cluster) to use for clusters. Default is |
... |
Further arguments passed to or from other methods. |
The plot.mbcfit
function provides a plotting method for objects of the class mbcfit
. It acts as a wrapper around the plot_clustering
function, allowing users to easily generate various plots to analyze the clustering results. A plot is produced only upon a successful mbcfit
estimate, i.e., when mbcfit
has code
equal to either 1
or 2
.
When data
is NULL
(the default), the function plots only contour sets (and optionally clustering boundaries) for the estimated mixture density components, using the params
information from the mbcfit
object. When data
is not NULL
, the function additionally plots data points and their hard clustering labels, which are obtained using mbcfit
to predict the cluster labels (see predict.mbcfit
).
A plot displaying the data with clustering information, contours, and/or boundaries, depending on the specified what
argument.
gmix
, plot_clustering
, link{predict.mbcfit}
# load data data("banknote") dat <- banknote[-1] # fit 2 clusters set.seed(123) fit <- gmix(dat, K = 2, init.nstart = 1) print(fit) # plot partition (default) plot(x = fit, data = dat) # plot partition onto the first 3 coordinates plot(x = fit, data = dat, subset = c(1:3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), what = "clustering") # additionally plot clustering boundary and contour sets plot(x = fit, data = dat, subset = c(1:3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), what = c("clustering", "boundary", "contour"))
# load data data("banknote") dat <- banknote[-1] # fit 2 clusters set.seed(123) fit <- gmix(dat, K = 2, init.nstart = 1) print(fit) # plot partition (default) plot(x = fit, data = dat) # plot partition onto the first 3 coordinates plot(x = fit, data = dat, subset = c(1:3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), what = "clustering") # additionally plot clustering boundary and contour sets plot(x = fit, data = dat, subset = c(1:3), pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"), what = c("clustering", "boundary", "contour"))
This function predicts cluster assignments for new data based on an existing model of class mbcfit
. The prediction leverages information from the fitted model to categorize new observations into clusters.
## S3 method for class 'mbcfit' predict(object, newdata, ...)
## S3 method for class 'mbcfit' predict(object, newdata, ...)
object |
An object of class |
newdata |
A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and |
... |
Further arguments passed to or from other methods. |
The predict.mbcfit
function utilizes the parameters of a previously fitted mbcfit
model to allocate new data points to estimated clusters. The function performs necessary checks to ensure the mbcfit
model returns valid estimates and the dimensionality of the new data aligns with the model.
The mbcfit
object must contain a component named params
, which is itself a list containing the following necessary elements, for a mixture model with K components:
proportions
A numeric vector of length K, with elements summing to 1, representing cluster proportions.
mean
A numeric matrix of dimensions c(P, K)
, representing cluster centers.
cov
A numeric array of dimensions c(P, P, K)
, representing cluster covariance matrices.
Data dimensionality is P
, and new data dimensionality must match (ncol(data)
must be equal to P
) or otherwise the function terminates with an error message.
The predicted clustering is obtained as the MAP estimator using posterior weights of a Gaussian mixture model parametrized at params
.
Denoting with the predicted cluster label for point
, and with
the (multivariate) Gaussian density:
A vector of length nrow(data)
containing the estimated cluster labels for each observation in the provided data
.
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: doi:10.1016/j.jmva.2023.105181
# load data data(banknote) dat <- banknote[,-1] # Estimate 3-components gaussian mixture model set.seed(123) res <- gmix(dat, K = 3) # Cluster in output from gmix print(res$cluster) # Predict cluster on a single point # (keep table dimension) predict(res, dat[1, , drop=FALSE]) # Predict cluster on a subset predict(res, dat[1:10, ]) # Predicted cluster on original dataset are equal to the clustering from the gmix model all(predict(res, dat) == res$cluster)
# load data data(banknote) dat <- banknote[,-1] # Estimate 3-components gaussian mixture model set.seed(123) res <- gmix(dat, K = 3) # Cluster in output from gmix print(res$cluster) # Predict cluster on a single point # (keep table dimension) predict(res, dat[1, , drop=FALSE]) # Predict cluster on a subset predict(res, dat[1:10, ]) # Predicted cluster on original dataset are equal to the clustering from the gmix model all(predict(res, dat) == res$cluster)
This function provides a print method for objects of class bqs
, which are produced by the bqs
function. It prints a summary of the bootstrapped quadratic score results for the clustering solutions considered.
## S3 method for class 'bqs' print(x, ...)
## S3 method for class 'bqs' print(x, ...)
x |
An object of class |
... |
Additional arguments passed to or from other methods. |
The print.bqs
function provides a print method for objects of class bqs
.
If clustering solutions in bqs
are not ranked, the printing method displays a message to the user signalling it. Otherwise, the printing method shows a summary of the top-6 ranked solutions, in decreasing order, for any available scoring method (this is determined by the oob
argument used in input to the bqs
function. See Details in bqs
).
The summary tables for ranked methods has row.names
set to the method's codename, and shows the following information along the columns:
id
Method's index in the methodset
list (see Details in bqs
).
rank
Method's rank according to ranking criterion.
mean
Method's mean (bootstrap) quadratic score.
sterr
Method's standard error for the (bootstrap) quadratic score.
lower_qnt
(Only shown for "mean" and "lq" ranking) Method's lower alpha/2
-level quantile of the bootstrap distribution of the quadratic score (alpha
is given in input to bqs
function).
upper_qnt
(Only shown for "mean" and "lq" ranking) Method's upper alpha/2
-level quantile of the bootstrap distribution of the quadratic score (alpha
is given in input to bqs
function).
-1se
(Only shown for "1se" ranking) Method's mean (bootstrap) quadratic score minus 1 standard error.
-1se
(Only shown for "1se" ranking) Method's mean (bootstrap) quadratic score plus 1 standard error.
No return value, called for side effects
# Load data and set seet set.seed(123) data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # se 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 100, type = "both", rankby=NA, ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) # Methods are not ranked; only available components are shown res # Rank method and show summaries ranked_res <- bqs_rank(res, rankby = "lq", boot_na_share = 0.25) ranked_res
# Load data and set seet set.seed(123) data("banknote") dat <- banknote[-1] # set up kmeans, see help('mset_kmeans') KM <- mset_kmeans(K = 2:5) # set up Gaussian model-based clustering via gmix() GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) # combine tuned methods mlist <- mbind(KM, GMIX) # perform bootstrap # se 'ncores' to the number of available physical cores res <- bqs(dat, mlist, B = 100, type = "both", rankby=NA, ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE) # Methods are not ranked; only available components are shown res # Rank method and show summaries ranked_res <- bqs_rank(res, rankby = "lq", boot_na_share = 0.25) ranked_res
This function provides a print method for objects of class mbcfit
, returned in output by the gmix
function.
## S3 method for class 'mbcfit' print(x, ...)
## S3 method for class 'mbcfit' print(x, ...)
x |
An object of class |
... |
Further arguments passed to or from other methods. |
The print.mbcfit
function gives a summary of a model-based clustering fit, estimated using the gmix
function.
The function handles different code
values from the object's info
field, each representing a specific status or error condition:
-2
'Lapack DSYEV failed'. This error occurs whenever any of the cluster-covariance matrices becomes singular during estimation, using the EM algorithm.
-1
'Memory allocation error'. This error occurs when there is insufficient available memory to allocate the quantities required to execute the EM algorithm.
1
Success.
2
'gmix' did not converge (iterations reached the maximum limit).
3
EM algorithm failed; no better than the initial solution. This error occurs whenever the EM algorithm failed for other reasons (e.g., degenerate posterior-weights could not be prevented), and it was not possible to find a solution.
The printed output also lists available components of the mbcfit
object and summarizes the number of clusters found and their size, whenever this information is available.
No return value, called for side effects
set.seed(123) # Estimate a simple a 3-clusters Gaussian mixture model, using iris data as example res <- gmix(iris[,-5], K = 3, erc = 10) # Print the 'gmix' output print(res)
set.seed(123) # Estimate a simple a 3-clusters Gaussian mixture model, using iris data as example res <- gmix(iris[,-5], K = 3, erc = 10) # Print the 'gmix' output print(res)
Computes both the hard and the smooth quadratic score of a clustering. Handles both
qscore(data, params, type = "both")
qscore(data, params, type = "both")
data |
a numeric vector, matrix, or data frame of observations.
Rows correspond to observations and columns correspond to variables/features.
Let |
params |
a list containing cluster parameters (size, mean, cov). Let
|
type |
the type of score, a character in the set
|
The function calculates quadratic scores as defined in equation (22) in Coraggio and Coretto (2023).
A numeric vector with both the hard and the smooth score, or only one of them
depending on the argument type
.
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. DOI: doi:10.1016/j.jmva.2023.105181
# --- load and split data data("banknote") set.seed(345) idx <- sample(1:nrow(banknote), size = 25, replace = FALSE) dat_f <- banknote[-idx, -1] ## training data set dat_v <- banknote[ idx, -1] ## validation data set # --- Gaussian model-based clustering, K=3 # fit clusters fit1 <- gmix(dat_f, K=3) ## compute quadratic scores using fitted mixture parameters s1 <- qscore(dat_v , params = fit1$params) s1 # --- k-means clustering, K=3 # obtain the k-means partition cl_km <- kmeans(dat_f, centers = 3, nstart = 1)$cluster ## convert k-means hard assignment into cluster parameters par_km <- clust2params(dat_f, cl_km) # compute quadratic scores s2 <- qscore(dat_v, params = par_km) s2
# --- load and split data data("banknote") set.seed(345) idx <- sample(1:nrow(banknote), size = 25, replace = FALSE) dat_f <- banknote[-idx, -1] ## training data set dat_v <- banknote[ idx, -1] ## validation data set # --- Gaussian model-based clustering, K=3 # fit clusters fit1 <- gmix(dat_f, K=3) ## compute quadratic scores using fitted mixture parameters s1 <- qscore(dat_v , params = fit1$params) s1 # --- k-means clustering, K=3 # obtain the k-means partition cl_km <- kmeans(dat_f, centers = 3, nstart = 1)$cluster ## convert k-means hard assignment into cluster parameters par_km <- clust2params(dat_f, cl_km) # compute quadratic scores s2 <- qscore(dat_v, params = par_km) s2