Title: | Robust Model-Based Clustering |
---|---|
Description: | Performs robust cluster analysis allowing for outliers and noise that cannot be fitted by any cluster. The data are modelled by a mixture of Gaussian distributions and a noise component, which is an improper uniform distribution covering the whole Euclidean space. Parameters are estimated by (pseudo) maximum likelihood. This is fitted by a EM-type algorithm. See Coretto and Hennig (2016) <doi:10.1080/01621459.2015.1100996>, and Coretto and Hennig (2017) <https://jmlr.org/papers/v18/16-382.html>. |
Authors: | Pietro Coretto [aut, cre] (Homepage: <https://pietro-coretto.github.io>), Christian Hennig [aut] (Homepage: <https://www.unibo.it/sitoweb/christian.hennig/en>) |
Maintainer: | Pietro Coretto <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0 |
Built: | 2024-12-08 07:04:52 UTC |
Source: | CRAN |
Data from Tables 1.1 and 1.2 (pp. 5-8) of Flury and Riedwyl (1988). There are six measurements made on 200 Swiss banknotes (the old-Swiss 1000-franc). The banknotes belong to two classes of equal size: genuine and counterfeit.
data(banknote)
data(banknote)
A data.frame
of dimension 200x7
with the following variables:
a factor
with classes: genuine
, counterfeit
Length of bill (mm)
Width of left edge (mm)
Width of right edge (mm)
Bottom margin width (mm)
Top margin width (mm)
Length of diagonal (mm)
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall.
This uses data and the output of otrimle
or
rimle
to generate a new artificial dataset of the size
of the original data using noise and cluster proportions from the
clustering output. The clusters are then generated from multivariate
normal distributions with the parameters estimated by
otrimle
, the noise is generated resampling from what is
estimated as moise component with weights given by posterior
probabilities of all observations to be noise. See Hennig and Coretto
(2021).
generator.otrimle(data, fit)
generator.otrimle(data, fit)
data |
something that can be coerced into a matrix. Dataset. |
fit |
A list with components data, clustering
.
data |
matrix of generated data. |
cs |
vector of integers. Clustering indicator. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
kerndensp
, kerndensmeasure
,
otrimle
, rimle
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ox <- otrimle(x, G=2 , ncores = 1) str(generator.otrimle(x, ox))
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ox <- otrimle(x, G=2 , ncores = 1) str(generator.otrimle(x, ox))
Computes the initial cluster assignment based on a combination of nearest neighbor based noise detection, and agglomerative hierarchical clustering based on maximum likelihood criteria for Gaussian mixture models.
InitClust(data , G , k = 3 , knnd.trim = 0.5 , modelName='VVV')
InitClust(data , G , k = 3 , knnd.trim = 0.5 , modelName='VVV')
data |
A numeric vector, matrix, or data frame of observations. Rows correspond
to observations and columns correspond to variables. Categorical
variables and |
G |
An integer specifying the number of clusters. |
k |
An integer specifying the number of considered nearest neighbors per point used for the denoising step (see Details). |
knnd.trim |
A number in [0,1) which defines the proportion of points
initialized as noise. Tipically |
modelName |
A character string indicating the covariance model to be used. Possible models are: |
The initialization is based on Coretto and Hennig (2017). First, wwo
steps are performed:
Step 1 (denoising step): for each data point compute its
k
th-
nearest neighbors
distance (k-
NND). All points with k-
NND larger
than the (1-knnd.trim
)-
quantile of the k-
NND
are initialized as noise. Intepretaion of
k
is that: (k-1)
, but not k
, points close
together may still be interpreted as noise or outliers
Step 2 (clustering step): perform the model-based hierarchical
clustering (MBHC) proposed in Fraley (1998). This step is performed using
hc
. The input argument modelName
is passed
to hc
. See Details of
hc
for more details.
If the previous Step 2 fails to provide G
clusters each
containing at least 2 distinct data points, it is replaced with
classical hirararchical clustering implemented in
hclust
. Finally, if
hclust
fails to provide a valid partition, up
to ten random partitions are tried.
An integer vector specifying the initial cluster
assignment with 0
denoting noise/outliers.
Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20:270-281.
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Pietro Coretto [email protected] https://pietro-coretto.github.io
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Initial clusters with default arguments init <- InitClust(data = x, G = 2) print(init) ## Perform otrimle a <- otrimle(data = x, G = 2, initial = init, logicd = c(-Inf, -50, -10), ncores = 1) plot(a, what="clustering", data=x)
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Initial clusters with default arguments init <- InitClust(data = x, G = 2) print(init) ## Perform otrimle a <- otrimle(data = x, G = 2, initial = init, logicd = c(-Inf, -50, -10), ncores = 1) plot(a, what="clustering", data=x)
This calls kerndensp
for computing and aggregating
density- and
principal components-based distances between
multivariate data and a unimodal
elliptical distribution about the data mean for all clusters in a
mixture-based clustering as generated by otrimle
or
rimle
. For use in otrimleg
.
kerndenscluster(x,fit,maxq=qnorm(0.9995),kernn=100)
kerndenscluster(x,fit,maxq=qnorm(0.9995),kernn=100)
x |
something that can be coerced into a matrix. Dataset. |
fit |
|
maxq |
positive numeric. One-dimensional densities are evaluated
between |
kernn |
integer. Number of points at which the one-dimensional
density is evaluated, input parameter |
See Hennig and Coretto (2021), Sec. 4.2. kerndenscluster
calls
kerndensp
for all clusters and aggregates the resulting
measures as root sum of squares.
A list with components ddpi, ddpm, measure
.
ddpi |
list of outputs of |
ddpm |
vector of |
measure |
Final aggregation result. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
kerndensp
, kerndensmeasure
,
otrimle
, rimle
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ox <- otrimle(x, G=2, ncores=1) kerndenscluster(x,ox)$measure
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ox <- otrimle(x, G=2, ncores=1) kerndenscluster(x,ox)$measure
Density-based distance between one-dimensional data and a unimodal symmetric distribution about the data mean based on Pons (2013, p.79), adapted by Hennig and Coretto (2021), see details.
kerndensmeasure(x,weights=rep(1,nrow(as.matrix(x))),maxq=qnorm(0.9995), kernn=100)
kerndensmeasure(x,weights=rep(1,nrow(as.matrix(x))),maxq=qnorm(0.9995), kernn=100)
x |
vector. One-dimensional dataset. |
weights |
non-negative vector. Relative weights of observations (will be standardised to sup up to one internally). |
maxq |
densities are evaluated between |
kernn |
integer. Number of points at which the density is
evaluated, input parameter |
Function density
is used in order to compute a kernel
density estimator from the data. The kernn
values of the
density are then ordered from the pargest to the smallest. Beginning
from the largest to the smallest, pairs of two values are formed
(largest and largest biggest, third and fourth largest, and so
on). Each pair is replaced by two copies of the average of the two
values. Then on each side of the mean one of each copy is placed from
the biggest to the smallest, and this produces a symmetric density
about the mean. The the root mean squared difference between this and
the original density is computed.
A list with components cp, cpx, measure
.
cp |
vector of generated symmetric density values from largest to
smallest (just one copy, sp |
cpx |
|
measure |
root mean squared difference between the densities. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
Pons, O. (2013). Statistical Tests of Nonparametric Hypotheses: Asymptotic Theory. World Scientific, Singapore.
set.seed(124578) x <- runif(20) str(kerndensmeasure(x))
set.seed(124578) x <- runif(20) str(kerndensmeasure(x))
Density- and and principal components-based distance between
multivariate data and a unimodal
elliptical distribution about the data mean, see Hennig and Coretto
(2021). For use in kerndenscluster
.
kerndensp(x,weights=rep(1,nrow(as.matrix(x))), siglist,maxq=qnorm(0.9995), kernn=100)
kerndensp(x,weights=rep(1,nrow(as.matrix(x))), siglist,maxq=qnorm(0.9995), kernn=100)
x |
something that can be coerced into a matrix. Dataset. |
weights |
non-negative vector. Relative weights of observations (will be standardised to sup up to one internally). |
siglist |
list with components |
maxq |
positive numeric. One-dimensional densities are evaluated
between |
kernn |
integer. Number of points at which the one-dimensional
density is evaluated, input parameter |
See Hennig and Coretto (2021), Sec. 4.2. kerndensmeasure
is run on the principal components of x
. The resulting measures
are standardised by kmeanfun
and ksdfun
and then aggregated as mean square of the positive values, see
Hennig and Coretto (2021). The PCS is computed by
princomp
and will always use siglist
rather than
statistics computed from x
.
A list with components cml, cm, pca, stanmeasure, measure
.
cml |
List of outputs of |
cm |
vector of |
stanmeasure |
vector of standardised |
pca |
output of |
measure |
Final aggregation result. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
kerndensmeasure
, kerndenscluster
set.seed(124578) x <- cbind(runif(20),runif(20)) siglist <- list(cov=cov(x),center=colMeans(x),n.obs=20) kerndensp(x,siglist=siglist)$measure
set.seed(124578) x <- cbind(runif(20),runif(20)) siglist <- list(cov=cov(x),center=colMeans(x),n.obs=20) kerndensp(x,siglist=siglist)$measure
These functions approximate the mean and standard deviation of the
unimodality statistic computed by kerndensmeasure
assuming standard Gaussian data dependent on the number of
observationsn
. They have been chosen based on a simulation involving 74
different values of n
. Used for standardisation in
kerndensp
.
kmeanfun(n) ksdfun(n)
kmeanfun(n) ksdfun(n)
n |
integer. Number of observations. |
The resulting mean (kmeanfun
) or standard deviation
(ksdfun
).
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
kmeanfun(50) ksdfun(50)
kmeanfun(50) ksdfun(50)
otrimle
searches for G
approximately Gaussian-shaped
clusters with/without noise/outliers. The method's tuning controlling the noise
level is adaptively chosen based on the data.
otrimle(data, G, initial = NULL, logicd = NULL, npr.max = 0.5, erc = 20, beta = 0, iter.max = 500, tol = 1e-06, ncores = NULL, monitor = TRUE)
otrimle(data, G, initial = NULL, logicd = NULL, npr.max = 0.5, erc = 20, beta = 0, iter.max = 500, tol = 1e-06, ncores = NULL, monitor = TRUE)
data |
A numeric vector, matrix, or data frame of observations. Rows correspond
to observations and columns correspond to variables. Categorical
variables and |
G |
An integer specifying the number of clusters. |
initial |
An integer vector specifying the initial cluster
assignment with |
logicd |
A vector defining a grid of log(icd) values, where icd
denotes the improper constant density. If |
npr.max |
A number in |
erc |
A number |
beta |
A non-negative constant. This is the beta penalty coefficient introduced in Coretto and Hennig (2016). |
iter.max |
An integer value specifying the maximum number of iterations allowed in the underlying ECM-algorithm. |
tol |
Stopping criterion for the underlying ECM-algorithm. An ECM iteration
stops if two successive improper log-likelihood values are within
|
ncores |
an integer value defining the number of cores used for parallel
computing. When |
monitor |
logical. If |
The otrimle
function computes the OTRIMLE solution based on the
ECM-algorithm (expectation conditional maximization algorithm)
proposed in Coretto and Hennig (2017).
The otrimle criterion is minimized over the logicd
grid of
log(icd)
values using parallel computing based on the
foreach
.
Note that, depending on the BLAS/LAPACK setting, increasing ncores
may not
produce the desired reduction in computing time. The latter
happens when optimized linear algebra routines are in use (e.g.
OpenBLAS, Intel Math Kernel Library (MKL), etc.). These optimized shared
libraries already implement multithreading. Therefore, in this case
increasing ncores
may only reduce the computing time marginally.
Occasionally, there may be datasets for which the function does not provide a
solution based on default arguments. This corresponds to
code=0
and flag=1
or flag=2
in the output (see
Value-section below). This usually happens when some (or all) of the
following circumstances occur: (i) erc
is too
large; (ii) npr.max
is too large; (iii) choice of the initial
partition. Regarding (i) and (ii) it is not possible to give numeric
references because whether these numbers are too large/small
strongly depends on the sample size and the dimensionality of the
data. References given below explain the relationship between
these quantities.
It is suggested to try the following whenever a code=0
non-solution occurs. Set the logicd
range wide enough
(e.g. logicd=seq(-500,-5, length=50)
), choose erc=1
,
and a low choice of npr.max
(e.g. npr.max=0.02
).
Monitor the solution with the criterion profiling plot
(plot.otrimle
). According to the criterion profiling
plot change logicd
, and increase erc
and npr.max
up to the point when a "clear" minimum in the criterion profiling plot
is obtained. If this strategy does not work it is suggested to
experiment with a different initial partitions (see initial
above).
TBA: Christian may add something about the beta
here.
The pi
object returned by the rimle
function (see
Value) corresponds to the vector of pi
parameters in
the underlying pseudo-model (1) defined in Coretto and Hennig (2017).
With logicd = -Inf
the rimle
function approximates the
MLE for the plain Gaussian mixture model with eigenratio
covariance regularization, in this case the the first element of the
pi
vector is set to zero because the noise component is not
considered. In general, for iid sampling from finite mixture models
context, these pi parameters define expected clusters'
proportions. Because of the noise proportion constraint in the RIMLE,
there are situations where this connection may not happen in
practice. The latter is likely to happen when both logicd
and
npr.max
are large. Therefore, estimated expected clusters' proportions
are reported in the exproportion
object of the rimle
output, and these are computed based on the improper posterior
probabilities given in tau
. See Coretto and Hennig (2017) for
more discussion on this.
An earlier approximate version of the algorithm was originally proposed in Coretto and Hennig (2016). Software for the original version of the algorithm can be found in the supplementary materials of Coretto and Hennig (2016).
An S3 object of class 'otrimle'
providing the optimal (according to
the OTRIMLE criterion) clustering. Output components are as follows:
code |
An integer indicator for the convergence.
|
flag |
A character string containing one or more flags related to
the EM iteration at the optimal icd.
|
iter |
Number of iterations performed in the underlying EM-algorithm at the
optimal |
logicd |
Resulting value of the optimal |
iloglik |
Resulting value of the improper likelihood. |
criterion |
Resulting value of the OTRIMLE criterion. |
pi |
Estimated vector of the |
mean |
A matrix of dimension |
cov |
An array of size |
tau |
A matrix of dimension |
smd |
A matrix of dimension |
cluster |
A vector of integers denoting cluster assignments for each
observation. It's |
size |
A vector of integers with sizes (counts) of each cluster. |
exproportion |
A vector of estimated expected clusters' proportions (see Details). |
optimization |
A data.frame with the OTRIMLE optimization profiling. For each
value of |
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Pietro Coretto [email protected] https://pietro-coretto.github.io
plot.otrimle
,
InitClust
,
rimle
,
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform otrimle clustering with default arguments set.seed(1) a <- otrimle(data=x, G=2, logicd=c(-Inf, -50, -10), ncores=1) ## Plot clustering plot(a, data=x, what="clustering") ## Plot OTRIMLE criterion profiling plot(a, what="criterion") ## Plot Improper log-likelihood profiling plot(a, what="iloglik") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(a, what="fit") plot(a, what="fit", cluster=1) ## Perform the same otrimle as before with non-zero penalty set.seed(1) b <- otrimle(data=x, G=2, beta = 0.5, logicd=c(-Inf, -50, -10), ncores=1) ## Plot clustering plot(b, data=x, what="clustering") ## Plot OTRIMLE criterion profiling plot(b, what="criterion") ## Plot Improper log-likelihood profiling plot(b, what="iloglik") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(b, what="fit") plot(b, what="fit", cluster=1) ## Not run: ## Perform the same example using the finer default grid of logicd ## values using multiple cores ## a <- otrimle(data = x, G = 2) ## Inspect the otrimle criterion-vs-logicd plot(a, what = 'criterion') ## The minimum occurs at a$logicd=-9, and exploring a$optimization it ## cane be seen that the interval [-12.5, -4] brackets the optimal ## solution. We search with a finer grid located around the minimum ## b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25)) ## Inspect the otrimle criterion-vs-logicd plot(b, what = 'criterion') ## Check the difference between the two clusterings table(A = a$cluster, B = b$cluster) ## Check differences in estimated parameters ## colSums(abs(a$mean - b$mean)) ## L1 distance for mean vectors apply({a$cov-b$cov}, 3, norm, type = "F") ## Frobenius distance for covariances c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions ## End(Not run)
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform otrimle clustering with default arguments set.seed(1) a <- otrimle(data=x, G=2, logicd=c(-Inf, -50, -10), ncores=1) ## Plot clustering plot(a, data=x, what="clustering") ## Plot OTRIMLE criterion profiling plot(a, what="criterion") ## Plot Improper log-likelihood profiling plot(a, what="iloglik") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(a, what="fit") plot(a, what="fit", cluster=1) ## Perform the same otrimle as before with non-zero penalty set.seed(1) b <- otrimle(data=x, G=2, beta = 0.5, logicd=c(-Inf, -50, -10), ncores=1) ## Plot clustering plot(b, data=x, what="clustering") ## Plot OTRIMLE criterion profiling plot(b, what="criterion") ## Plot Improper log-likelihood profiling plot(b, what="iloglik") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(b, what="fit") plot(b, what="fit", cluster=1) ## Not run: ## Perform the same example using the finer default grid of logicd ## values using multiple cores ## a <- otrimle(data = x, G = 2) ## Inspect the otrimle criterion-vs-logicd plot(a, what = 'criterion') ## The minimum occurs at a$logicd=-9, and exploring a$optimization it ## cane be seen that the interval [-12.5, -4] brackets the optimal ## solution. We search with a finer grid located around the minimum ## b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25)) ## Inspect the otrimle criterion-vs-logicd plot(b, what = 'criterion') ## Check the difference between the two clusterings table(A = a$cluster, B = b$cluster) ## Check differences in estimated parameters ## colSums(abs(a$mean - b$mean)) ## L1 distance for mean vectors apply({a$cov-b$cov}, 3, norm, type = "F") ## Frobenius distance for covariances c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions ## End(Not run)
Computes Optimally Tuned Robust Improper Maximum Likelihood Clustering
(OTRIMLE), see otrimle
,
together with the
density-based cluster quality statistics Q (Hennig and Coretto 2021)
for a range of values of the number of clusters.
otrimleg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, fixlogicd=NULL, monitor=1, dmaxq=qnorm(0.9995))
otrimleg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, fixlogicd=NULL, monitor=1, dmaxq=qnorm(0.9995))
dataset |
something that can be coerced into an observations times variables matrix. The dataset. |
G |
vector of integers (normally starting from 1). Numbers of clusters to be considered. |
multicore |
logical. If |
ncores |
integer. Number of cores for parallelisation. |
erc |
A number larger or equal than one specifying the maximum
allowed ratio between within-cluster covariance matrix
eigenvalues. See |
beta0 |
A non-negative constant, penalty term for noise, to be
passed as |
fixlogicd |
numeric of |
monitor |
0 or 1. If 1, progress messages are printed on screen. |
dmaxq |
numeric. Passed as |
For estimating the number of clusters this is meant to be called by
otrimlesimg
. The output of otrimleg
is not
meant to be used directly for estimating the number of clusters, see
Hennig and Coretto (2021).
otrimleg
returns a list
containing the components solution, iloglik, ibic, criterion,
logicd, noiseprob, denscrit, ddpm
. All of these are lists or
vectors of which the component number is the number of clusters.
solution |
|
iloglik |
vector of improper likelihood values from
|
ibic |
vector of improper BIC-values (small is good) computed
from |
criterion |
vector of values of OTRIMLE criterion, see
|
noiseprob |
vector of estimated noise proportions,
|
denscrit |
vector of density-based cluster quality statistics Q
(Hennig and Coretto 2021) as provided by the
|
ddpm |
list of the vector of cluster-wise density-based cluster
quality measures as provided by the
|
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi: 10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
otrimle
, rimle
, otrimlesimg
,
kerndensmeasure
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) x <- banknote[selectdata,5:7] obanknote <- otrimleg(x,G=1:2,multicore=FALSE)
data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) x <- banknote[selectdata,5:7] obanknote <- otrimleg(x,G=1:2,multicore=FALSE)
otrimlesimg
computes Optimally Tuned Robust Improper Maximum
Likelihood Clustering
(OTRIMLE), see otrimle
for a range of values of the
number of clusters, and also for artificial datasets simulated from
the model parameters estimated on the original data. The
summary
-methods present and evaluate the results so that a
smallest adequate number of clusters can be found as the smallest one
for which the value of the density-based cluster quality statistics Q
on the original data
is compatible with its distribution on the artificial datasets with
the same number of clusters, see Hennig and Coretto 2021 for details.
otrimlesimg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20, sim.est.logicd=FALSE, monitor=1) ## S3 method for class 'otrimlesimgdens' summary(object, noisepenalty=0.05 , sdcutoff=2 , ...) ## S3 method for class 'summary.otrimlesimgdens' print(x, ...) ## S3 method for class 'summary.otrimlesimgdens' plot(x , plot="criterion", penx=NULL, peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)
otrimlesimg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20, sim.est.logicd=FALSE, monitor=1) ## S3 method for class 'otrimlesimgdens' summary(object, noisepenalty=0.05 , sdcutoff=2 , ...) ## S3 method for class 'summary.otrimlesimgdens' print(x, ...) ## S3 method for class 'summary.otrimlesimgdens' plot(x , plot="criterion", penx=NULL, peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)
dataset |
something that can be coerced into an observations times variables matrix. The dataset. |
G |
vector of integers (normally starting from 1). Numbers of clusters to be considered. |
multicore |
logical. If |
ncores |
integer. Number of cores for parallelisation. |
erc |
A number larger or equal than one specifying the maximum
allowed ratio between within-cluster covariance matrix
eigenvalues. See |
beta0 |
A non-negative constant, penalty term for noise, to be
passed as |
simruns |
integer. Number of replicate artificial datasets drawn from each model. |
sim.est.logicd |
logical. If |
monitor |
0 or 1. If 1, progress messages are printed on screen. |
noisepenalty |
number between 0 and 1. |
sdcutoff |
numerical. |
plot |
|
penx |
|
peny |
|
pencex |
numeric. Magnification factor (parameter |
cutoff |
logical. If |
ylim |
vector of two numericals, range of the y-axis to be passed
on to |
object |
an object of class |
x |
an object of class |
... |
optional parameters to be passed on to |
The method is fully described in Hennig and Coretto
(2021). The required tuning constants for choosing an optimal number
of clusters, the smallest percentage of additional noise that the user
is willing to trade in for adding another cluster (p_0
in the
paper, noisepenalty
here) and the critical value (c
in
the paper, sdcutoff
here) for adequacy of the standardised
density based quality measure Q are provided to the summary function,
which is required to choose the best (simplest adequate) number of
clusters.
The plot function plot.summary.otrimlesimgdens
can produce two
plots. If plot="criterion"
, the standardised density-based
cluster quality
measure Q is plotted against the number of clusters. The values for
the simulated artificial datasets are points, the values for the
original dataset are given as line type. If cutoff="TRUE"
, the
critical values (see above) are added as red crosses; a number of
clusters is adequate if the value of the original data is below the
critical value, i.e., Q is not significantly larger than for the
artificial datasets generated from the fitted model. Using
penx
, the ordered numbers of clusters from the simplest to the
least simple can also be indicated in the plot, where simplicitly is
defined as the number of clusters plus the estimated noise proportion
divided by noisepenalty
, see above. The chosen number of
clusters is the simplest adequate one, meaning that a low number of
clusters and a low noise proportion are preferred.
If plot="noise"
, the noise proportion (black) and the
simplicity (red) are plotted against the numnber of clusters.
otrimlesimg
returns a list of type "otrimlesimgdens"
containing the components result, simresult, simruns
.
result |
output object of |
simresult |
list of length |
simruns |
input parameter |
summary.otrimlesimgdens
returns a list of type
"summary.otrimlesimgdens"
with components G, simeval,
ssimruns, npr, nprdiff, logicd, denscrit, peng,
penorder, bestG, sdcutoff, bestresult,
cluster
. simruns
G |
|
simeval |
list with components |
ssimruns |
|
npr |
vector of estimated noise proportions on the original data
for all numbers of clusters, |
nprdiff |
vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data. |
logicd |
vector of logs of improper constant density values on the original data for all numbers of clusters. |
denscrit |
vector over all numbers of clusters of density-based
cluster quality statistics Q
on original data as provided by the |
peng |
vector of simplicity values (see Details) over all numbers of clusters. |
penorder |
simplicity order of number of clusters. |
bestG |
best (i.e., most simple adequate) number of clusters. |
sdcutoff |
input parameter |
result |
output of |
cluster |
clustering vector for the best number of
clusters |
Components of summary.otrimlesimgdens
output component
simeval
:
denscritmatrix |
maximum number of clusters times |
meandens |
vector over numbers of clusters of robust estimator of
the mean of |
sddens |
vector over numbers of clusters of robust estimator of
the standard deviation of |
standens |
vector over numbers of clusters of |
errors |
vector over numbers of clusters of numbers of times that
|
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
otrimle
, rimle
, otrimleg
,
kerndensmeasure
## otrimlesimg is computer intensive, so only a small data subset ## is used for speed. data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ## simruns=2 chosen for speed. This is not recommended in practice. obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0) sobanknote <- summary(obanknote) print(sobanknote) plot(sobanknote,plot="criterion",penx=1.4) plot(sobanknote,plot="noise",penx=1.4) plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
## otrimlesimg is computer intensive, so only a small data subset ## is used for speed. data(banknote) selectdata <- c(1:30,101:110,117:136,160:161) set.seed(555566) x <- banknote[selectdata,5:7] ## simruns=2 chosen for speed. This is not recommended in practice. obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0) sobanknote <- summary(obanknote) print(sobanknote) plot(sobanknote,plot="criterion",penx=1.4) plot(sobanknote,plot="noise",penx=1.4) plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
Plot robust model-based clustering results: scatter plot with clustering information, optimization profiling, and cluster fit.
## S3 method for class 'otrimle' plot(x, what=c("criterion","iloglik", "fit", "clustering"), data=NULL, margins=NULL, cluster=NULL, ...)
## S3 method for class 'otrimle' plot(x, what=c("criterion","iloglik", "fit", "clustering"), data=NULL, margins=NULL, cluster=NULL, ...)
x |
Output from |
what |
The type of graph. It can be one of the following:
|
data |
The data vector, matrix or data.frame (or some transformation of them), used for obtaining the
|
margins |
A vector of integers denoting the variables (numbers of columns of
|
cluster |
An integer denoting the cluster for which the fit
plot is returned. This is only relevant if |
... |
further arguments passed to or from other methods. |
what="criterion"
A plot with the profiling of the OTRIMLE criterion
optimization. Criterion at log(icd)=-Inf
is always represented.
what="iloglik"
A plot with the profiling of the improper log-
likelihood function
along the search path for the OTRIMLE optimization.
what="fit"
The P-
P plot (probability-
probability plot) of the weighted empirical
distribution function of the Mahalanobis distances of observations
from clusters' centers against the target distribution. The target
distribution is the Chi-square distribution with degrees of
freedom equal to
ncol(data)
. The weights are given by the improper posterior
probabilities. If cluster=NULL
P-
P plots are produced for
all clusters, otherwise cluster
selects a single P-
P
plot at times.
what="clustering"
A pairwise scatterplot with cluster memberships. Points
assigned to the noise/outliers component are denoted by
'+'
.
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Pietro Coretto [email protected] https://pietro-coretto.github.io
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform otrimle clustering on a small grid of logicd values a <- otrimle(data = x, G = 2, logicd = c(-Inf, -50, -10), ncores = 1) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## Plot clustering on selected margins plot(a, data = x, what = "clustering", margins = 4:6) ## Plot clustering on the first two principal components z <- scale(x) %*% eigen(cor(x), symmetric = TRUE)$vectors colnames(z) <- paste("PC", 1:ncol(z), sep = "") plot(a, data = z, what = "clustering", margins = 1:2) ## Plot OTRIMLE criterion profiling plot(a, what = "criterion") ## Plot Improper log-likelihood profiling plot(a, what = "iloglik") ## Fit plot for all clusters plot(a, what = "fit") ## Fit plot for cluster 1 plot(a, what = "fit", cluster = 1) ## Not run: ## Perform the same example using the finer default grid of logicd ## values using multiple cores ## a <- otrimle(data = x, G = 2) ## Inspect the otrimle criterion-vs-logicd plot(a, what = 'criterion') ## The minimum occurs at a$logicd=-9, and exploring a$optimization it ## cane be seen that the interval [-12.5, -4] brackets the optimal ## solution. We search with a finer grid located around the minimum ## b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25)) ## Inspect the otrimle criterion-vs-logicd plot(b, what = 'criterion') ## Check the difference between the two clusterings table(A = a$cluster, B = b$cluster) ## Check differences in estimated parameters ## colSums(abs(a$mean - b$mean)) ## L1 distance for mean vectors apply({a$cov-b$cov}, 3, norm, type = "F") ## Frobenius distance for covariances c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions ## End(Not run)
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform otrimle clustering on a small grid of logicd values a <- otrimle(data = x, G = 2, logicd = c(-Inf, -50, -10), ncores = 1) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## Plot clustering on selected margins plot(a, data = x, what = "clustering", margins = 4:6) ## Plot clustering on the first two principal components z <- scale(x) %*% eigen(cor(x), symmetric = TRUE)$vectors colnames(z) <- paste("PC", 1:ncol(z), sep = "") plot(a, data = z, what = "clustering", margins = 1:2) ## Plot OTRIMLE criterion profiling plot(a, what = "criterion") ## Plot Improper log-likelihood profiling plot(a, what = "iloglik") ## Fit plot for all clusters plot(a, what = "fit") ## Fit plot for cluster 1 plot(a, what = "fit", cluster = 1) ## Not run: ## Perform the same example using the finer default grid of logicd ## values using multiple cores ## a <- otrimle(data = x, G = 2) ## Inspect the otrimle criterion-vs-logicd plot(a, what = 'criterion') ## The minimum occurs at a$logicd=-9, and exploring a$optimization it ## cane be seen that the interval [-12.5, -4] brackets the optimal ## solution. We search with a finer grid located around the minimum ## b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25)) ## Inspect the otrimle criterion-vs-logicd plot(b, what = 'criterion') ## Check the difference between the two clusterings table(A = a$cluster, B = b$cluster) ## Check differences in estimated parameters ## colSums(abs(a$mean - b$mean)) ## L1 distance for mean vectors apply({a$cov-b$cov}, 3, norm, type = "F") ## Frobenius distance for covariances c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions ## End(Not run)
Plot robust model-based clustering results: scatter plot with clustering information and cluster fit.
## S3 method for class 'rimle' plot(x, what=c("fit", "clustering"), data=NULL, margins=NULL, cluster=NULL, ...)
## S3 method for class 'rimle' plot(x, what=c("fit", "clustering"), data=NULL, margins=NULL, cluster=NULL, ...)
x |
Output from |
what |
The type of graph. It can be one of the following:
|
data |
The data vector, matrix or data.frame (or some
transformation of them), used for obtaining the
|
margins |
A vector of integers denoting the variables (numbers of columns of
|
cluster |
An integer denoting the cluster for which the fit
plot is returned. This is only relevant if |
... |
further arguments passed to or from other methods. |
what="fit"
The P-
P plot (probability-
probability plot) of the weighted empirical
distribution function of the Mahalanobis distances of observations
from clusters' centers against the target distribution. The target
distribution is the Chi-square distribution with degrees of
freedom equal to ncol(data)
. The weights are given by the improper posterior
probabilities. If cluster=NULL
P-
P plots are produced for
all clusters, otherwise cluster
selects a single P-
P
plot at times.
what="clustering"
A pairwise scatterplot with cluster memberships. Points
assigned to the noise/outliers component are denoted by
'+'
.
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Pietro Coretto [email protected] https://pietro-coretto.github.io
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform rimle clustering with default arguments set.seed(1) a <- rimle(data = x, G = 2) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## Plot clustering on selected margins plot(a, data = x, what = "clustering", margins = 4:6) ## Plot clustering on the first two principal components z <- scale(x) %*% eigen(cor(x), symmetric = TRUE)$vectors colnames(z) <- paste("PC", 1:ncol(z), sep = "") plot(a, data = z, what = "clustering", margins = 1:2) ## Fit plot for all clusters plot(a, what = "fit") ## Fit plot for cluster 1 plot(a, what = "fit", cluster = 1)
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## Perform rimle clustering with default arguments set.seed(1) a <- rimle(data = x, G = 2) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## Plot clustering on selected margins plot(a, data = x, what = "clustering", margins = 4:6) ## Plot clustering on the first two principal components z <- scale(x) %*% eigen(cor(x), symmetric = TRUE)$vectors colnames(z) <- paste("PC", 1:ncol(z), sep = "") plot(a, data = z, what = "clustering", margins = 1:2) ## Fit plot for all clusters plot(a, what = "fit") ## Fit plot for cluster 1 plot(a, what = "fit", cluster = 1)
rimle
searches for G
approximately Gaussian-shaped
clusters with/without noise/outliers. The method's tuning controlling
the noise level is fixed and is to be provided by the user or will be guessed by
the function in a rather quick and dirty way (otrimle
performs a more sophisticated data-driven choice).
rimle(data, G, initial=NULL, logicd=NULL, npr.max=0.5, erc=20, iter.max=500, tol=1e-6)
rimle(data, G, initial=NULL, logicd=NULL, npr.max=0.5, erc=20, iter.max=500, tol=1e-6)
data |
A numeric vector, matrix, or data frame of observations. Rows correspond
to observations and columns correspond to variables. Categorical
variables and |
G |
An integer specifying the number of clusters. |
initial |
An integer vector specifying the initial cluster
assignment with |
logicd |
A number |
npr.max |
A number in |
erc |
A number |
iter.max |
An integer value specifying the maximum number of iterations allowed in the ECM-algorithm (see Details). |
tol |
Stopping criterion for the underlying ECM-algorithm. An ECM iteration
stops if two successive improper log-likelihood values are within
|
The rimle
function computes the RIMLE solution using the
ECM-algorithm proposed in Coretto and Hennig (2017).
There may be datasets for which the function does not provide a
solution based on default arguments. This corresponds to
code=0
and flag=1
or flag=2
in the output (see
Value-section below). This usually happens when some (or all) of the
following circumstances occur: (i) log(icd)
is too
large; (ii) erc
is too large; (iii) npr.max is too large;
(iv) choice of the initial partition. In these cases it is suggested
to find a suitable interval of icd
values by using the
otrimle
function. The Details section of
otrimle
suggests several actions to take
whenever a code=0
non-solution occurs.
The pi
object returned by the rimle
function (see
Value) corresponds to the vector of pi
parameters in
the underlying pseudo-model (1) defined in Coretto and Hennig (2017).
With logicd = -Inf
the rimle
function approximates the
MLE for the plain Gaussian mixture model with eigenratio
covariance regularization, in this case the the first element of the
pi
vector is set to zero because the noise component is not
considered. In general, for iid sampling from finite mixture models
context, these pi parameters define expected clusters'
proportions. Because of the noise proportion constraint in the RIMLE,
there are situations where this connection may not happen in
practice. The latter is likely to happen when both logicd
and
npr.max
are large. Therefore, estimated expected clusters'
proportions are reported in the exproportion
object of the
rimle
output, and these are computed based on the
improper posterior probabilities given in tau
.
See Coretto and Hennig (2017) for more discussion on this.
An earlier approximate version of the algorithm was originally proposed in Coretto and Hennig (2016). Software for the original version of the algorithm can be found in the supplementary materials of Coretto and Hennig (2016).
An S3 object of class 'rimle'
. Output components are as follows:
code |
An integer indicator for the convergence.
|
flag |
A character string containing one or more flags related to
the EM iteration at the optimal icd.
|
iter |
Number of iterations performed in the underlying EM-algorithm. |
logicd |
Value of the |
iloglik |
Value of the improper likelihood. |
criterion |
Value of the OTRIMLE criterion. |
pi |
Estimated vector of the |
mean |
A matrix of dimension |
cov |
An array of size |
tau |
A matrix of dimension |
smd |
A matrix of dimension |
cluster |
A vector of integers denoting cluster assignments for each
observation. It's |
size |
A vector of integers with sizes (counts) of each cluster. |
exproportion |
A vector of estimated expected clusters' proportions (see Details). |
Pietro Coretto [email protected] https://pietro-coretto.github.io
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
plot.rimle
,
InitClust
,
otrimle
,
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## ----------------------------------------------------------------------------- ## EXAMPLE 1: ## Perform RIMLE with default inputs ## ----------------------------------------------------------------------------- set.seed(1) a <- rimle(data = x, G = 2) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(a, what = "fit") plot(a, what = "fit", cluster = 1) ## ----------------------------------------------------------------------------- ## EXAMPLE 2: ## Compare solutions for different choices of logicd ## ----------------------------------------------------------------------------- set.seed(1) ## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model b1 <- rimle(data = x, G = 2, logicd = -Inf) plot(b1, data=x, what="clustering") plot(b1, what="fit") ## Case 2: low noise level b2 <- rimle(data = x, G = 2, logicd = -100) plot(b2, data=x, what="clustering") plot(b2, what="fit") ## Case 3: medium noise level b3 <- rimle(data = x, G = 2, logicd = -10) plot(b3, data=x, what="clustering") plot(b3, what="fit") ## Case 3: large noise level b3 <- rimle(data = x, G = 2, logicd = 5) plot(b3, data=x, what="clustering") plot(b3, what="fit")
## Load Swiss banknotes data data(banknote) x <- banknote[,-1] ## ----------------------------------------------------------------------------- ## EXAMPLE 1: ## Perform RIMLE with default inputs ## ----------------------------------------------------------------------------- set.seed(1) a <- rimle(data = x, G = 2) print(a) ## Plot clustering plot(a, data = x, what = "clustering") ## P-P plot of the clusterwise empirical weighted squared Mahalanobis ## distances against the target distribution pchisq(, df=ncol(data)) plot(a, what = "fit") plot(a, what = "fit", cluster = 1) ## ----------------------------------------------------------------------------- ## EXAMPLE 2: ## Compare solutions for different choices of logicd ## ----------------------------------------------------------------------------- set.seed(1) ## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model b1 <- rimle(data = x, G = 2, logicd = -Inf) plot(b1, data=x, what="clustering") plot(b1, what="fit") ## Case 2: low noise level b2 <- rimle(data = x, G = 2, logicd = -100) plot(b2, data=x, what="clustering") plot(b2, what="fit") ## Case 3: medium noise level b3 <- rimle(data = x, G = 2, logicd = -10) plot(b3, data=x, what="clustering") plot(b3, what="fit") ## Case 3: large noise level b3 <- rimle(data = x, G = 2, logicd = 5) plot(b3, data=x, what="clustering") plot(b3, what="fit")