Package 'otrimle' reference manual

Title:	Robust Model-Based Clustering
Description:	Performs robust cluster analysis allowing for outliers and noise that cannot be fitted by any cluster. The data are modelled by a mixture of Gaussian distributions and a noise component, which is an improper uniform distribution covering the whole Euclidean space. Parameters are estimated by (pseudo) maximum likelihood. This is fitted by a EM-type algorithm. See Coretto and Hennig (2016) <doi:10.1080/01621459.2015.1100996>, and Coretto and Hennig (2017) <https://jmlr.org/papers/v18/16-382.html>.
Authors:	Pietro Coretto [aut, cre] (Homepage: <https://pietro-coretto.github.io>), Christian Hennig [aut] (Homepage: <https://www.unibo.it/sitoweb/christian.hennig/en>)
Maintainer:	Pietro Coretto <[email protected]>
License:	GPL (>= 2)
Version:	2.0
Built:	2025-03-08 06:21:16 UTC
Source:	CRAN

Swiss Banknotes Data

Description

Data from Tables 1.1 and 1.2 (pp. 5-8) of Flury and Riedwyl (1988). There are six measurements made on 200 Swiss banknotes (the old-Swiss 1000-franc). The banknotes belong to two classes of equal size: genuine and counterfeit.

Usage

data(banknote)data(banknote)

Format

A data.frame of dimension 200x7 with the following variables:

Class: a factor with classes: genuine, counterfeit
Length: Length of bill (mm)
Left: Width of left edge (mm)
Right: Width of right edge (mm)
Bottom: Bottom margin width (mm)
Top: Top margin width (mm)
Diagonal: Length of diagonal (mm)

Source

Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall.

Generates random data from OTRIMLE output model

Description

This uses data and the output of otrimle or rimle to generate a new artificial dataset of the size of the original data using noise and cluster proportions from the clustering output. The clusters are then generated from multivariate normal distributions with the parameters estimated by otrimle, the noise is generated resampling from what is estimated as moise component with weights given by posterior probabilities of all observations to be noise. See Hennig and Coretto (2021).

Usage

  generator.otrimle(data, fit)
generator.otrimle(data, fit)

Arguments

`data`	something that can be coerced into a matrix. Dataset.
`fit`	output object of `otrimle` or `rimle`.

Value

A list with components data, clustering.

`data`	matrix of generated data.
`cs`	vector of integers. Clustering indicator.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.

Examples

   data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   set.seed(555566)
   x <- banknote[selectdata,5:7]
   ox <- otrimle(x, G=2 , ncores = 1)
   str(generator.otrimle(x, ox))
data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   set.seed(555566)
   x <- banknote[selectdata,5:7]
   ox <- otrimle(x, G=2 , ncores = 1)
   str(generator.otrimle(x, ox))

Robust Initialization for Model-based Clustering Methods

Description

Computes the initial cluster assignment based on a combination of nearest neighbor based noise detection, and agglomerative hierarchical clustering based on maximum likelihood criteria for Gaussian mixture models.

Usage

 InitClust(data , G , k = 3 , knnd.trim = 0.5 , modelName='VVV')
 InitClust(data , G , k = 3 , knnd.trim = 0.5 , modelName='VVV')

Arguments

`data`	A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and `NA` values are not allowed.
`G`	An integer specifying the number of clusters.
`k`	An integer specifying the number of considered nearest neighbors per point used for the denoising step (see Details).
`knnd.trim`	A number in [0,1) which defines the proportion of points initialized as noise. Tipically `knnd.trim<=0.5` (see Details).
`modelName`	A character string indicating the covariance model to be used. Possible models are: `"E"`: equal variance (one-dimensional) `"V"` : spherical, variable variance (one-dimensional) `"EII"`: spherical, equal volume `"VII"`: spherical, unequal volume `"EEE"`: ellipsoidal, equal volume, shape, and orientation `"VVV"`: ellipsoidal, varying volume, shape, and orientation (default). See Details.

Details

The initialization is based on Coretto and Hennig (2017). First, wwo steps are performed:

Step 1 (denoising step): for each data point compute its kth-nearest neighbors distance (k-NND). All points with k-NND larger than the (1-knnd.trim)-quantile of the k-NND are initialized as noise. Intepretaion of k is that: (k-1), but not k, points close together may still be interpreted as noise or outliers

Step 2 (clustering step): perform the model-based hierarchical clustering (MBHC) proposed in Fraley (1998). This step is performed using hc. The input argument modelName is passed to hc. See Details of hc for more details.

If the previous Step 2 fails to provide G clusters each containing at least 2 distinct data points, it is replaced with classical hirararchical clustering implemented in hclust. Finally, if hclust fails to provide a valid partition, up to ten random partitions are tried.

Value

An integer vector specifying the initial cluster assignment with 0 denoting noise/outliers.

References

Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20:270-281.

P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html

Author(s)

Pietro Coretto [email protected] https://pietro-coretto.github.io

Examples

 ## Load  Swiss banknotes data
 data(banknote)
 x <- banknote[,-1]

 ## Initial clusters with default arguments
 init <- InitClust(data = x, G = 2)
 print(init)

 ## Perform otrimle
 a <- otrimle(data = x, G = 2, initial = init,
              logicd = c(-Inf, -50, -10), ncores = 1)
 plot(a, what="clustering", data=x)
 ## Load  Swiss banknotes data
 data(banknote)
 x <- banknote[,-1]

 ## Initial clusters with default arguments
 init <- InitClust(data = x, G = 2)
 print(init)

 ## Perform otrimle
 a <- otrimle(data = x, G = 2, initial = init,
              logicd = c(-Inf, -50, -10), ncores = 1)
 plot(a, what="clustering", data=x)

Aggregated distance to elliptical unimodal density over clusters

Description

This calls kerndensp for computing and aggregating density- and principal components-based distances between multivariate data and a unimodal elliptical distribution about the data mean for all clusters in a mixture-based clustering as generated by otrimle or rimle. For use in otrimleg.

Usage

  kerndenscluster(x,fit,maxq=qnorm(0.9995),kernn=100)
kerndenscluster(x,fit,maxq=qnorm(0.9995),kernn=100)

Arguments

`x`	something that can be coerced into a matrix. Dataset.
`fit`	output object of `otrimle` or `rimle`.
`maxq`	positive numeric. One-dimensional densities are evaluated between `mean(x)-maxq` and `mean(x)+maxq`.
`kernn`	integer. Number of points at which the one-dimensional density is evaluated, input parameter `n` of `density`. This should be even.

Details

See Hennig and Coretto (2021), Sec. 4.2. kerndenscluster calls kerndensp for all clusters and aggregates the resulting measures as root sum of squares.

Value

A list with components ddpi, ddpm, measure.

`ddpi`	list of outputs of `kerndensp` for all clusters.
`ddpm`	vector of `measure`-components of `ddpi`.
`measure`	Final aggregation result.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Examples

   data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   set.seed(555566)
   x <- banknote[selectdata,5:7]
   ox <- otrimle(x, G=2, ncores=1)
   kerndenscluster(x,ox)$measure
data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   set.seed(555566)
   x <- banknote[selectdata,5:7]
   ox <- otrimle(x, G=2, ncores=1)
   kerndenscluster(x,ox)$measure

Statistic measuring closeness to symmetric unimodal distribution

Description

Density-based distance between one-dimensional data and a unimodal symmetric distribution about the data mean based on Pons (2013, p.79), adapted by Hennig and Coretto (2021), see details.

Usage

  kerndensmeasure(x,weights=rep(1,nrow(as.matrix(x))),maxq=qnorm(0.9995),
                  kernn=100)
kerndensmeasure(x,weights=rep(1,nrow(as.matrix(x))),maxq=qnorm(0.9995),
                  kernn=100)

Arguments

`x`	vector. One-dimensional dataset.
`weights`	non-negative vector. Relative weights of observations (will be standardised to sup up to one internally).
`maxq`	densities are evaluated between `mean(x)-maxq` and `mean(x)+maxq`.
`kernn`	integer. Number of points at which the density is evaluated, input parameter `n` of `density`. This should be even.

Details

Function density is used in order to compute a kernel density estimator from the data. The ⁠kernn⁠ values of the density are then ordered from the pargest to the smallest. Beginning from the largest to the smallest, pairs of two values are formed (largest and largest biggest, third and fourth largest, and so on). Each pair is replaced by two copies of the average of the two values. Then on each side of the mean one of each copy is placed from the biggest to the smallest, and this produces a symmetric density about the mean. The the root mean squared difference between this and the original density is computed.

Value

A list with components cp, cpx, measure.

`cp`	vector of generated symmetric density values from largest to smallest (just one copy, sp `kernn/2` values).
`cpx`	`y`-component of `density`-output.
`measure`	root mean squared difference between the densities.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Pons, O. (2013). Statistical Tests of Nonparametric Hypotheses: Asymptotic Theory. World Scientific, Singapore.

Examples

  set.seed(124578)
  x <- runif(20)
  str(kerndensmeasure(x))
set.seed(124578)
  x <- runif(20)
  str(kerndensmeasure(x))

Closeness of multivariate distribution to elliptical unimodal distribution

Description

Density- and and principal components-based distance between multivariate data and a unimodal elliptical distribution about the data mean, see Hennig and Coretto (2021). For use in kerndenscluster.

Usage

  kerndensp(x,weights=rep(1,nrow(as.matrix(x))), siglist,maxq=qnorm(0.9995),
            kernn=100)
kerndensp(x,weights=rep(1,nrow(as.matrix(x))), siglist,maxq=qnorm(0.9995),
            kernn=100)

Arguments

`x`	something that can be coerced into a matrix. Dataset.
`weights`	non-negative vector. Relative weights of observations (will be standardised to sup up to one internally).
`siglist`	list with components `cov` (covariance matrix), `center` (mean) and `n.obs` (number of observations).
`maxq`	positive numeric. One-dimensional densities are evaluated between `mean(x)-maxq` and `mean(x)+maxq`.
`kernn`	integer. Number of points at which the one-dimensional density is evaluated, input parameter `n` of `density`. This should be even.

Details

See Hennig and Coretto (2021), Sec. 4.2. kerndensmeasure is run on the principal components of x. The resulting measures are standardised by kmeanfun and ksdfun and then aggregated as mean square of the positive values, see Hennig and Coretto (2021). The PCS is computed by princomp and will always use siglist rather than statistics computed from x.

Value

A list with components cml, cm, pca, stanmeasure, measure.

`cml`	List of outputs of `kerndensmeasure` for the principal components.
`cm`	vector of `measure` components of `kerndensmeasure` for the principal components.
`stanmeasure`	vector of standardised `measure` components of `kerndensmeasure` for the principal components.
`pca`	output of `princomp`.
`measure`	Final aggregation result.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Examples

  set.seed(124578)
  x <- cbind(runif(20),runif(20))
  siglist <- list(cov=cov(x),center=colMeans(x),n.obs=20) 
  kerndensp(x,siglist=siglist)$measure
set.seed(124578)
  x <- cbind(runif(20),runif(20))
  siglist <- list(cov=cov(x),center=colMeans(x),n.obs=20) 
  kerndensp(x,siglist=siglist)$measure

Mean and standard deviation of unimodality statistic

Description

These functions approximate the mean and standard deviation of the unimodality statistic computed by kerndensmeasure assuming standard Gaussian data dependent on the number of observationsn. They have been chosen based on a simulation involving 74 different values of n. Used for standardisation in kerndensp.

Usage

  kmeanfun(n)
  ksdfun(n)
kmeanfun(n)
  ksdfun(n)

Arguments

`n`	integer. Number of observations.

Value

The resulting mean (kmeanfun) or standard deviation (ksdfun).

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

Examples

   kmeanfun(50)
   ksdfun(50)
kmeanfun(50)
   ksdfun(50)

Optimally Tuned Robust Improper Maximum Likelihood Clustering

Description

otrimle searches for G approximately Gaussian-shaped clusters with/without noise/outliers. The method's tuning controlling the noise level is adaptively chosen based on the data.

Usage

otrimle(data, G, initial = NULL, logicd = NULL, npr.max = 0.5, erc = 20,
beta = 0, iter.max = 500, tol = 1e-06, ncores = NULL, monitor = TRUE)
otrimle(data, G, initial = NULL, logicd = NULL, npr.max = 0.5, erc = 20,
beta = 0, iter.max = 500, tol = 1e-06, ncores = NULL, monitor = TRUE)

Arguments

`data`	A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and `NA` values are not allowed.
`G`	An integer specifying the number of clusters.
`initial`	An integer vector specifying the initial cluster assignment with `0` denoting noise/outliers. If `NULL` (default) initialization is performed using `InitClust`.
`logicd`	A vector defining a grid of log(icd) values, where icd denotes the improper constant density. If `logicd=NULL` a default grid is considered. A pure Gaussian Mixture Model fit (obtained when `log(icd)=-Inf`) is included in the default search path.
`npr.max`	A number in `[0,1)` specifying the maximum proportion of noise/outliers. This defines the noise proportion constraint. If `npr.max=0` a single solution without noise component is computed (corresponding to `logicd = -Inf`.
`erc`	A number `>=1` specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. This defines the eigenratio constraint. `erc=1` enforces spherical clusters with equal covariance matrices. A large `erc` allows for large between-cluster covariance discrepancies. In order to facilitate the setting of `erc`, it is suggested to scale the columns of `data` (see `scale`) whenever measurement units of the different variables are grossly incompatible.
`beta`	A non-negative constant. This is the beta penalty coefficient introduced in Coretto and Hennig (2016).
`iter.max`	An integer value specifying the maximum number of iterations allowed in the underlying ECM-algorithm.
`tol`	Stopping criterion for the underlying ECM-algorithm. An ECM iteration stops if two successive improper log-likelihood values are within `tol`.
`ncores`	an integer value defining the number of cores used for parallel computing. When `ncores=NULL` (default), the number `r` of available cores is detected, and `(r-1)` of them are used (See Details). If `ncores=1` no parallel backend is started.
`monitor`	logical. If `TRUE` progress messages are printed on screen.

Details

The otrimle function computes the OTRIMLE solution based on the ECM-algorithm (expectation conditional maximization algorithm) proposed in Coretto and Hennig (2017).

The otrimle criterion is minimized over the logicd grid of log(icd) values using parallel computing based on the foreach. Note that, depending on the BLAS/LAPACK setting, increasing ncores may not produce the desired reduction in computing time. The latter happens when optimized linear algebra routines are in use (e.g. OpenBLAS, Intel Math Kernel Library (MKL), etc.). These optimized shared libraries already implement multithreading. Therefore, in this case increasing ncores may only reduce the computing time marginally.

Occasionally, there may be datasets for which the function does not provide a solution based on default arguments. This corresponds to code=0 and flag=1 or flag=2 in the output (see Value-section below). This usually happens when some (or all) of the following circumstances occur: (i) erc is too large; (ii) npr.max is too large; (iii) choice of the initial partition. Regarding (i) and (ii) it is not possible to give numeric references because whether these numbers are too large/small strongly depends on the sample size and the dimensionality of the data. References given below explain the relationship between these quantities.

It is suggested to try the following whenever a code=0 non-solution occurs. Set the logicd range wide enough (e.g. logicd=seq(-500,-5, length=50)), choose erc=1, and a low choice of npr.max (e.g. npr.max=0.02). Monitor the solution with the criterion profiling plot (plot.otrimle). According to the criterion profiling plot change logicd, and increase erc and npr.max up to the point when a "clear" minimum in the criterion profiling plot is obtained. If this strategy does not work it is suggested to experiment with a different initial partitions (see initial above).

TBA: Christian may add something about the beta here.

The pi object returned by the rimle function (see Value) corresponds to the vector of pi parameters in the underlying pseudo-model (1) defined in Coretto and Hennig (2017). With logicd = -Inf the rimle function approximates the MLE for the plain Gaussian mixture model with eigenratio covariance regularization, in this case the the first element of the pi vector is set to zero because the noise component is not considered. In general, for iid sampling from finite mixture models context, these pi parameters define expected clusters' proportions. Because of the noise proportion constraint in the RIMLE, there are situations where this connection may not happen in practice. The latter is likely to happen when both logicd and npr.max are large. Therefore, estimated expected clusters' proportions are reported in the exproportion object of the rimle output, and these are computed based on the improper posterior probabilities given in tau. See Coretto and Hennig (2017) for more discussion on this.

An earlier approximate version of the algorithm was originally proposed in Coretto and Hennig (2016). Software for the original version of the algorithm can be found in the supplementary materials of Coretto and Hennig (2016).

Value

An S3 object of class 'otrimle' providing the optimal (according to the OTRIMLE criterion) clustering. Output components are as follows:

`code`	An integer indicator for the convergence. `code=0` if no solution is found (see Details); `code=1` if at the optimal icd value the corresponding EM-algorithm did not converge within `em.iter.max`; `code=2` convergence is fully achieved.
`flag`	A character string containing one or more flags related to the EM iteration at the optimal icd. `flag=1` if it was not possible to prevent the numerical degeneracy of improper posterior probabilities (`tau` value below). `flag=2` if enforcement of the noise proportion constraint failed for numerical reasons. `flag=3` if the noise proportion constraint has been successfully applied at least once. `flag=4` if the eigenratio constraint has been successfully applied at least once.
`iter`	Number of iterations performed in the underlying EM-algorithm at the optimal `icd`.
`logicd`	Resulting value of the optimal `log(icd)`.
`iloglik`	Resulting value of the improper likelihood.
`criterion`	Resulting value of the OTRIMLE criterion.
`pi`	Estimated vector of the `pi` parameters of the underlying pseudo-model (see Details).
`mean`	A matrix of dimension `ncol(data) x G` containing the mean parameters of each cluster (column-wise).
`cov`	An array of size `ncol(data) x ncol(data) x G` containing the covariance matrices of each cluster.
`tau`	A matrix of dimension `nrow(data) x {1+G}` where `tau[i, 1+j]` is the estimated (improper) posterior probability that the ith observation belongs to the jth cluster. `tau[i,1]` is the estimated (improper) posterior probability that ith observation belongs to the noise component.
`smd`	A matrix of dimension `nrow(data) x G` where `smd[i,j]` is the squared Mahalanobis distance of `data[i,]` from `mean[,j]` according to `cov[,,j]`.
`cluster`	A vector of integers denoting cluster assignments for each observation. It's `0` for observations assigned to noise/outliers.
`size`	A vector of integers with sizes (counts) of each cluster.
`exproportion`	A vector of estimated expected clusters' proportions (see Details).
`optimization`	A data.frame with the OTRIMLE optimization profiling. For each value of `log(icd)` explored by the algorithm the data.frame stores `logicd, criterion, iloglik, code, flag` (defined above), and `enpr` that denotes the expected noise proportion.

References

Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi:10.1080/01621459.2015.1100996

Author(s)

Pietro Coretto [email protected] https://pietro-coretto.github.io

Examples

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform otrimle clustering with default arguments
set.seed(1)
a <- otrimle(data=x, G=2, logicd=c(-Inf, -50, -10), ncores=1)

## Plot clustering
plot(a, data=x, what="clustering")

## Plot OTRIMLE criterion profiling
plot(a, what="criterion")

## Plot Improper log-likelihood profiling
plot(a, what="iloglik")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what="fit")
plot(a, what="fit", cluster=1)



## Perform the same otrimle as before with non-zero penalty
set.seed(1)
b <- otrimle(data=x, G=2, beta = 0.5, logicd=c(-Inf, -50, -10), ncores=1)

## Plot clustering
plot(b, data=x, what="clustering")

## Plot OTRIMLE criterion profiling
plot(b, what="criterion")

## Plot Improper log-likelihood profiling
plot(b, what="iloglik")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(b, what="fit")
plot(b, what="fit", cluster=1)





   
## Not run: 
## Perform the same example using the finer default grid of logicd
## values using multiple cores
##
a <- otrimle(data = x, G = 2)

## Inspect the otrimle criterion-vs-logicd
plot(a, what = 'criterion')

## The minimum occurs at  a$logicd=-9, and exploring a$optimization it
## cane be seen that the interval [-12.5, -4] brackets the optimal
## solution. We search with a finer grid located around the minimum
##
b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25))

## Inspect the otrimle criterion-vs-logicd
plot(b, what = 'criterion')

## Check the difference between the two clusterings
table(A = a$cluster, B = b$cluster)

## Check differences in estimated parameters
##
colSums(abs(a$mean - b$mean))               ## L1 distance for mean vectors
apply({a$cov-b$cov}, 3, norm, type = "F")   ## Frobenius distance for covariances
c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions

## End(Not run)
## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform otrimle clustering with default arguments
set.seed(1)
a <- otrimle(data=x, G=2, logicd=c(-Inf, -50, -10), ncores=1)

## Plot clustering
plot(a, data=x, what="clustering")

## Plot OTRIMLE criterion profiling
plot(a, what="criterion")

## Plot Improper log-likelihood profiling
plot(a, what="iloglik")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what="fit")
plot(a, what="fit", cluster=1)



## Perform the same otrimle as before with non-zero penalty
set.seed(1)
b <- otrimle(data=x, G=2, beta = 0.5, logicd=c(-Inf, -50, -10), ncores=1)

## Plot clustering
plot(b, data=x, what="clustering")

## Plot OTRIMLE criterion profiling
plot(b, what="criterion")

## Plot Improper log-likelihood profiling
plot(b, what="iloglik")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(b, what="fit")
plot(b, what="fit", cluster=1)





   
## Not run: 
## Perform the same example using the finer default grid of logicd
## values using multiple cores
##
a <- otrimle(data = x, G = 2)

## Inspect the otrimle criterion-vs-logicd
plot(a, what = 'criterion')

## The minimum occurs at  a$logicd=-9, and exploring a$optimization it
## cane be seen that the interval [-12.5, -4] brackets the optimal
## solution. We search with a finer grid located around the minimum
##
b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25))

## Inspect the otrimle criterion-vs-logicd
plot(b, what = 'criterion')

## Check the difference between the two clusterings
table(A = a$cluster, B = b$cluster)

## Check differences in estimated parameters
##
colSums(abs(a$mean - b$mean))               ## L1 distance for mean vectors
apply({a$cov-b$cov}, 3, norm, type = "F")   ## Frobenius distance for covariances
c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions

## End(Not run)

OTRIMLE for a range of numbers of clusters with density-based cluster quality statistic

Description

Computes Optimally Tuned Robust Improper Maximum Likelihood Clustering (OTRIMLE), see otrimle, together with the density-based cluster quality statistics Q (Hennig and Coretto 2021) for a range of values of the number of clusters.

Usage

otrimleg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1,
   erc=20, beta0=0, fixlogicd=NULL, monitor=1, dmaxq=qnorm(0.9995))
otrimleg(dataset, G=1:6, multicore=TRUE, ncores=detectCores(logical=FALSE)-1,
   erc=20, beta0=0, fixlogicd=NULL, monitor=1, dmaxq=qnorm(0.9995))

Arguments

`dataset`	something that can be coerced into an observations times variables matrix. The dataset.
`G`	vector of integers (normally starting from 1). Numbers of clusters to be considered.
`multicore`	logical. If `TRUE`, parallel computing is used through the function `mclapply` from package `parallel`; read warnings there if you intend to use this; it won't work on Windows.
`ncores`	integer. Number of cores for parallelisation.
`erc`	A number larger or equal than one specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. See `otrimle`.
`beta0`	A non-negative constant, penalty term for noise, to be passed as `beta` to `otrimle`, see documentation there.
`fixlogicd`	numeric of `NULL`. Value for the logarithm of the improper constant density `logicd`, see `rimle`, which is run instead of `otrimle` if this is not `NULL`. `NULL` means that `otrimle` determines it from the data.
`monitor`	0 or 1. If 1, progress messages are printed on screen.
`dmaxq`	numeric. Passed as `maxq` to `kerndensmeasure`. The interval considered for the one-dimensional density estimator is `(-maxq,maxq)`.

Details

For estimating the number of clusters this is meant to be called by otrimlesimg. The output of otrimleg is not meant to be used directly for estimating the number of clusters, see Hennig and Coretto (2021).

Value

otrimleg returns a list containing the components solution, iloglik, ibic, criterion, logicd, noiseprob, denscrit, ddpm. All of these are lists or vectors of which the component number is the number of clusters.

`solution`	list of output objects of `otrimle` or `rimle`.
`iloglik`	vector of improper likelihood values from `otrimle`.
`ibic`	vector of improper BIC-values (small is good) computed from `iloglik` and the numbers of parameters. Note that the behaviour of the improper likelihood is not compatible with the standard use of the BIC, so this is experimental and should not be trusted for choosing the number of clusters.
`criterion`	vector of values of OTRIMLE criterion, see `otrimle`.
`noiseprob`	vector of estimated noise proportions, `exproportion[1]` from `otrimle`.
`denscrit`	vector of density-based cluster quality statistics Q (Hennig and Coretto 2021) as provided by the `measure`-component of `kerndensmeasure`.
`ddpm`	list of the vector of cluster-wise density-based cluster quality measures as provided by the `ddpm`-component of `kerndensmeasure`.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Examples

   data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   x <- banknote[selectdata,5:7]
   obanknote <- otrimleg(x,G=1:2,multicore=FALSE)
data(banknote)
   selectdata <- c(1:30,101:110,117:136,160:161)
   x <- banknote[selectdata,5:7]
   obanknote <- otrimleg(x,G=1:2,multicore=FALSE)

Adequacy approach for number of clusters for OTRIMLE

Description

otrimlesimg computes Optimally Tuned Robust Improper Maximum Likelihood Clustering (OTRIMLE), see otrimle for a range of values of the number of clusters, and also for artificial datasets simulated from the model parameters estimated on the original data. The summary-methods present and evaluate the results so that a smallest adequate number of clusters can be found as the smallest one for which the value of the density-based cluster quality statistics Q on the original data is compatible with its distribution on the artificial datasets with the same number of clusters, see Hennig and Coretto 2021 for details.

Usage

otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE, 
monitor=1)

## S3 method for class 'otrimlesimgdens'
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)

## S3 method for class 'summary.otrimlesimgdens'
print(x, ...)

## S3 method for class 'summary.otrimlesimgdens'
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)

otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE, 
monitor=1)

## S3 method for class 'otrimlesimgdens'
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)

## S3 method for class 'summary.otrimlesimgdens'
print(x, ...)

## S3 method for class 'summary.otrimlesimgdens'
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)

Arguments

`dataset`	something that can be coerced into an observations times variables matrix. The dataset.
`G`	vector of integers (normally starting from 1). Numbers of clusters to be considered.
`multicore`	logical. If `TRUE`, parallel computing is used through the function `mclapply` from package `parallel`; read warnings there if you intend to use this; it won't work on Windows.
`ncores`	integer. Number of cores for parallelisation.
`erc`	A number larger or equal than one specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. See `otrimle`.
`beta0`	A non-negative constant, penalty term for noise, to be passed as `beta` to `otrimle`, see documentation there.
`simruns`	integer. Number of replicate artificial datasets drawn from each model.
`sim.est.logicd`	logical. If `TRUE`, the logarithm of the improper constant density `logicd`, see `otrimle`, is re-estimated when running `otrimle` on the artificial datasets. Otherwise the value estimated on the original data is taken as fixed. `TRUE` requires much longer computation time, but can be seen as generating more realistic variation.
`monitor`	0 or 1. If 1, progress messages are printed on screen.
`noisepenalty`	number between 0 and 1. `p_0` in Hennig and Coretto (2021); normally small. The method prefers to treat a proportion of `<=noisepenalty` of points as outliers to adding a cluster.
`sdcutoff`	numerical. `c` in formula (7) in Hennig and Coretto (2021). A clustering is treated as adequate if its value of the density-based cluster quality measure Q calibrated (i.e., mean/sd-standardised) by the values on the artificial datasets generated from the estimated model is `<=sdcutoff`.
`plot`	`"criterion"` or `"noise"`, see details.
`penx`	`FALSE, NULL`, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If `FALSE`, this is not added to the plot. If `NULL` a default guess is made for a good position (which doesn't always work well).
`peny`	`NULL`, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If `NULL`, a default guess is made for a good position (which doesn't always work well).
`pencex`	numeric. Magnification factor (parameter `cex` to be passed on to `legend`) for simplicity ordering, see parameter `penx`.
`cutoff`	logical. If `TRUE`, the `"criterion"`-plot shows the cutoff value below which numbers of clusters are adequate, see details.
`ylim`	vector of two numericals, range of the y-axis to be passed on to `plot`. If `NULL`, the range is chosen automatically (but can be different from the `plot` default).
`object`	an object of class `'otrimlesimgdens'` obtained from calling `otrimlesimg`
`x`	an object of class `'summary.otrimlesimgdens'` obtained from calling `summary` function over an object of class `'otrimlesimgdens'` obtained from calling `otrimlesimg`.
`...`	optional parameters to be passed on to `plot`.

Details

The method is fully described in Hennig and Coretto (2021). The required tuning constants for choosing an optimal number of clusters, the smallest percentage of additional noise that the user is willing to trade in for adding another cluster (p_0 in the paper, noisepenalty here) and the critical value (c in the paper, sdcutoff here) for adequacy of the standardised density based quality measure Q are provided to the summary function, which is required to choose the best (simplest adequate) number of clusters.

The plot function plot.summary.otrimlesimgdens can produce two plots. If plot="criterion", the standardised density-based cluster quality measure Q is plotted against the number of clusters. The values for the simulated artificial datasets are points, the values for the original dataset are given as line type. If cutoff="TRUE", the critical values (see above) are added as red crosses; a number of clusters is adequate if the value of the original data is below the critical value, i.e., Q is not significantly larger than for the artificial datasets generated from the fitted model. Using penx, the ordered numbers of clusters from the simplest to the least simple can also be indicated in the plot, where simplicitly is defined as the number of clusters plus the estimated noise proportion divided by noisepenalty, see above. The chosen number of clusters is the simplest adequate one, meaning that a low number of clusters and a low noise proportion are preferred.

If plot="noise", the noise proportion (black) and the simplicity (red) are plotted against the numnber of clusters.

Value

otrimlesimg returns a list of type "otrimlesimgdens" containing the components result, simresult, simruns.

`result`	output object of `otrimleg` (list of results on original data) run with the parameters provided to `otrimlesimg`.
`simresult`	list of length `simruns` of output objects of `otrimleg` for all the simulated artificial datasets.
`simruns`	input parameter `simruns`.

summary.otrimlesimgdens returns a list of type "summary.otrimlesimgdens" with components G, simeval, ssimruns, npr, nprdiff, logicd, denscrit, peng, penorder, bestG, sdcutoff, bestresult, cluster. simruns

`G`	`otrmlesimg` input parameter `G` (numbers of clusters).
`simeval`	list with components `denscrit, meandens, sddens, standens, errors`, defined below.
`ssimruns`	`otrmlesimg` input parameter `simruns`.
`npr`	vector of estimated noise proportions on the original data for all numbers of clusters, `exproportion[1]` from `otrimle`.
`nprdiff`	vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data.
`logicd`	vector of logs of improper constant density values on the original data for all numbers of clusters.
`denscrit`	vector over all numbers of clusters of density-based cluster quality statistics Q on original data as provided by the `measure`-component of `kerndensmeasure`.
`peng`	vector of simplicity values (see Details) over all numbers of clusters.
`penorder`	simplicity order of number of clusters.
`bestG`	best (i.e., most simple adequate) number of clusters.
`sdcutoff`	input parameter `sdcutoff`.
`result`	output of `otrimle` for the best number of clusters `bestG`.
`cluster`	clustering vector for the best number of clusters `bestG`. `0` corresponds to noise/outliers.

Components of summary.otrimlesimgdens output component simeval:

`denscritmatrix`	maximum number of clusters times `simruns` matrix of `denscrit`-vectors for all clusterings on simulated data.
`meandens`	vector over numbers of clusters of robust estimator of the mean of `denscrit` over simulated datasets, computed by `scaleTau2`.
`sddens`	vector over numbers of clusters of robust estimator of the standard deviation of `denscrit` over simulated datasets, computed by `scaleTau2`.
`standens`	vector over numbers of clusters of `denscrit` of the original data standardised by `meandens` and `sddens`.
`errors`	vector over numbers of clusters of numbers of times that `otrimle led to an error.` `plot.summary.otrimlesimgdens` will return the output of `par()` before anything was changed by the plot function.

Author(s)

Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/

References

Examples

## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
   
## simruns=2 chosen for speed. This is not recommended in practice. 
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
   
## simruns=2 chosen for speed. This is not recommended in practice. 
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])

Plot Methods for OTRIMLE Objects

Description

Plot robust model-based clustering results: scatter plot with clustering information, optimization profiling, and cluster fit.

Usage

## S3 method for class 'otrimle'
plot(x, what=c("criterion","iloglik", "fit", "clustering"),
     data=NULL, margins=NULL, cluster=NULL, ...)
## S3 method for class 'otrimle'
plot(x, what=c("criterion","iloglik", "fit", "clustering"),
     data=NULL, margins=NULL, cluster=NULL, ...)

Arguments

`x`	Output from `otrimle`
`what`	The type of graph. It can be one of the following: `"criterion"` (default), `"iloglik"`, `"fit"`, `"clustering"`. See Details.
`data`	The data vector, matrix or data.frame (or some transformation of them), used for obtaining the `'otrimle'` object. This is only relevant if `what="clustering"`.
`margins`	A vector of integers denoting the variables (numbers of columns of `data`) to be used for a `pairs`-plot if `what="clustering"`. When `margins=NULL` it is set to `1:ncol(data)` (default).
`cluster`	An integer denoting the cluster for which the fit plot is returned. This is only relevant if `what="fit"`.
`...`	further arguments passed to or from other methods.

Value

If what="criterion": A plot with the profiling of the OTRIMLE criterion optimization. Criterion at log(icd)=-Inf is always represented.
If what="iloglik": A plot with the profiling of the improper log-likelihood function along the search path for the OTRIMLE optimization.
If what="fit": The P-P plot (probability-probability plot) of the weighted empirical distribution function of the Mahalanobis distances of observations from clusters' centers against the target distribution. The target distribution is the Chi-square distribution with degrees of freedom equal to ncol(data). The weights are given by the improper posterior probabilities. If cluster=NULL P-P plots are produced for all clusters, otherwise cluster selects a single P-P plot at times.
If what="clustering": A pairwise scatterplot with cluster memberships. Points assigned to the noise/outliers component are denoted by '+'.

References

Author(s)

Pietro Coretto [email protected] https://pietro-coretto.github.io

Examples

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform otrimle clustering on a small grid of logicd values
a <- otrimle(data = x, G = 2, logicd = c(-Inf, -50, -10), ncores = 1)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## Plot clustering on selected margins
plot(a, data = x, what = "clustering", margins = 4:6)

## Plot clustering on the first two principal components
z <- scale(x) %*%   eigen(cor(x), symmetric = TRUE)$vectors
colnames(z) <- paste("PC", 1:ncol(z), sep = "")
plot(a, data = z, what = "clustering", margins = 1:2)

## Plot OTRIMLE criterion profiling
plot(a, what = "criterion")

## Plot Improper log-likelihood profiling
plot(a, what = "iloglik")

## Fit plot for all clusters
plot(a, what = "fit")

## Fit plot for cluster 1
plot(a, what = "fit", cluster = 1)



## Not run: 
## Perform the same example using the finer default grid of logicd
## values using multiple cores
##
a <- otrimle(data = x, G = 2)

## Inspect the otrimle criterion-vs-logicd
plot(a, what = 'criterion')

## The minimum occurs at  a$logicd=-9, and exploring a$optimization it
## cane be seen that the interval [-12.5, -4] brackets the optimal
## solution. We search with a finer grid located around the minimum
##
b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25))

## Inspect the otrimle criterion-vs-logicd
plot(b, what = 'criterion')

## Check the difference between the two clusterings
table(A = a$cluster, B = b$cluster)

## Check differences in estimated parameters
##
colSums(abs(a$mean - b$mean))               ## L1 distance for mean vectors
apply({a$cov-b$cov}, 3, norm, type = "F")   ## Frobenius distance for covariances
c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions

## End(Not run)

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform otrimle clustering on a small grid of logicd values
a <- otrimle(data = x, G = 2, logicd = c(-Inf, -50, -10), ncores = 1)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## Plot clustering on selected margins
plot(a, data = x, what = "clustering", margins = 4:6)

## Plot clustering on the first two principal components
z <- scale(x) %*%   eigen(cor(x), symmetric = TRUE)$vectors
colnames(z) <- paste("PC", 1:ncol(z), sep = "")
plot(a, data = z, what = "clustering", margins = 1:2)

## Plot OTRIMLE criterion profiling
plot(a, what = "criterion")

## Plot Improper log-likelihood profiling
plot(a, what = "iloglik")

## Fit plot for all clusters
plot(a, what = "fit")

## Fit plot for cluster 1
plot(a, what = "fit", cluster = 1)



## Not run: 
## Perform the same example using the finer default grid of logicd
## values using multiple cores
##
a <- otrimle(data = x, G = 2)

## Inspect the otrimle criterion-vs-logicd
plot(a, what = 'criterion')

## The minimum occurs at  a$logicd=-9, and exploring a$optimization it
## cane be seen that the interval [-12.5, -4] brackets the optimal
## solution. We search with a finer grid located around the minimum
##
b <- otrimle(data = x, G = 2, logicd = seq(-12.5, -4, length.out = 25))

## Inspect the otrimle criterion-vs-logicd
plot(b, what = 'criterion')

## Check the difference between the two clusterings
table(A = a$cluster, B = b$cluster)

## Check differences in estimated parameters
##
colSums(abs(a$mean - b$mean))               ## L1 distance for mean vectors
apply({a$cov-b$cov}, 3, norm, type = "F")   ## Frobenius distance for covariances
c(Noise=abs(a$npr-b$npr), abs(a$cpr-b$cpr)) ## Absolute difference for proportions

## End(Not run)

Plot Methods for RIMLE Objects

Description

Plot robust model-based clustering results: scatter plot with clustering information and cluster fit.

Usage

## S3 method for class 'rimle'
plot(x, what=c("fit", "clustering"),
     data=NULL, margins=NULL, cluster=NULL, ...)
## S3 method for class 'rimle'
plot(x, what=c("fit", "clustering"),
     data=NULL, margins=NULL, cluster=NULL, ...)

Arguments

`x`	Output from `rimle`
`what`	The type of graph. It can be one of the following: `"fit"` (default), `"clustering"`. See Details.
`data`	The data vector, matrix or data.frame (or some transformation of them), used for obtaining the `'rimle'` object. This is only relevant if `what="clustering"`.
`margins`	A vector of integers denoting the variables (numbers of columns of `data`) to be used for a `pairs`-plot if `what="clustering"`. When `margins=NULL` it is set to `1:ncol(data)` (default).
`cluster`	An integer denoting the cluster for which the fit plot is returned. This is only relevant if `what="fit"`.
`...`	further arguments passed to or from other methods.

Value

If what="fit": The P-P plot (probability-probability plot) of the weighted empirical distribution function of the Mahalanobis distances of observations from clusters' centers against the target distribution. The target distribution is the Chi-square distribution with degrees of freedom equal to ncol(data). The weights are given by the improper posterior probabilities. If cluster=NULL P-P plots are produced for all clusters, otherwise cluster selects a single P-P plot at times.
If what="clustering": A pairwise scatterplot with cluster memberships. Points assigned to the noise/outliers component are denoted by '+'.

References

Author(s)

Pietro Coretto [email protected] https://pietro-coretto.github.io

Examples

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform rimle clustering with default arguments
set.seed(1)
a <- rimle(data = x, G = 2)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## Plot clustering on selected margins
plot(a, data = x, what = "clustering", margins = 4:6)

## Plot clustering on the first two principal components
z <- scale(x) %*%   eigen(cor(x), symmetric = TRUE)$vectors
colnames(z) <- paste("PC", 1:ncol(z), sep = "")
plot(a, data = z, what = "clustering", margins = 1:2)

## Fit plot for all clusters
plot(a, what = "fit")

## Fit plot for cluster 1
plot(a, what = "fit", cluster = 1)
## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## Perform rimle clustering with default arguments
set.seed(1)
a <- rimle(data = x, G = 2)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## Plot clustering on selected margins
plot(a, data = x, what = "clustering", margins = 4:6)

## Plot clustering on the first two principal components
z <- scale(x) %*%   eigen(cor(x), symmetric = TRUE)$vectors
colnames(z) <- paste("PC", 1:ncol(z), sep = "")
plot(a, data = z, what = "clustering", margins = 1:2)

## Fit plot for all clusters
plot(a, what = "fit")

## Fit plot for cluster 1
plot(a, what = "fit", cluster = 1)

Robust Improper Maximum Likelihood Clustering

Description

rimle searches for G approximately Gaussian-shaped clusters with/without noise/outliers. The method's tuning controlling the noise level is fixed and is to be provided by the user or will be guessed by the function in a rather quick and dirty way (otrimle performs a more sophisticated data-driven choice).

Usage

rimle(data, G, initial=NULL, logicd=NULL, npr.max=0.5, erc=20, iter.max=500, tol=1e-6)
rimle(data, G, initial=NULL, logicd=NULL, npr.max=0.5, erc=20, iter.max=500, tol=1e-6)

Arguments

`data`	A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and `NA` values are not allowed.
`G`	An integer specifying the number of clusters.
`initial`	An integer vector specifying the initial cluster assignment with `0` denoting noise/outliers. If `NULL` (default) initialization is performed using `InitClust`.
`logicd`	A number `log(icd)`, where `0 <= icd < Inf` is the value of the improper constant density (`icd`). This is the RIMLE's tuning for controlling the size of the noise. If `logicd=NULL` (default), an `icd` value is guessed based on the data. A pure Gaussian Mixture Model fit is obtained with `logicd = -Inf`.
`npr.max`	A number in `[0,1)` specifying the maximum proportion of noise/outliers. This defines the noise proportion constraint. If `npr.max=0` a solution without noise component is computed (corresponding to `logicd = -Inf`.
`erc`	A number `>=1` specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. This defines the eigenratio constraint. `erc=1` enforces spherical clusters with equal covariance matrices. A large `erc` allows for large between-cluster covariance discrepancies. In order to facilitate the setting of `erc`, it is suggested to scale the columns of `data` (see `scale`) whenever measurement units of the different variables are grossly incompatible.
`iter.max`	An integer value specifying the maximum number of iterations allowed in the ECM-algorithm (see Details).
`tol`	Stopping criterion for the underlying ECM-algorithm. An ECM iteration stops if two successive improper log-likelihood values are within `tol`.

Details

The rimle function computes the RIMLE solution using the ECM-algorithm proposed in Coretto and Hennig (2017).

There may be datasets for which the function does not provide a solution based on default arguments. This corresponds to code=0 and flag=1 or flag=2 in the output (see Value-section below). This usually happens when some (or all) of the following circumstances occur: (i) log(icd) is too large; (ii) erc is too large; (iii) npr.max is too large; (iv) choice of the initial partition. In these cases it is suggested to find a suitable interval of icd values by using the otrimle function. The Details section of otrimle suggests several actions to take whenever a code=0 non-solution occurs.

Value

An S3 object of class 'rimle'. Output components are as follows:

`code`	An integer indicator for the convergence. `code=0` if no solution is found (see Details); `code=1` if the EM-algorithm did not converge within `em.iter.max`; `code=2` convergence is fully achieved.
`flag`	A character string containing one or more flags related to the EM iteration at the optimal icd. `flag=1` if it was not possible to prevent the numerical degeneracy of improper posterior probabilities (`tau` value below). `flag=2` if enforcement of the noise proportion constraint failed for numerical reasons. `flag=3` if enforcement of the eigenratio constraint failed for numerical reasons. `flag=4` if the noise proportion constraint has been successfully applied at least once. `flag=5` if the eigenratio constraint has been successfully applied at least once.
`iter`	Number of iterations performed in the underlying EM-algorithm.
`logicd`	Value of the `log(icd)`.
`iloglik`	Value of the improper likelihood.
`criterion`	Value of the OTRIMLE criterion.
`pi`	Estimated vector of the `pi` parameters of the underlying pseudo-model (see Details).
`mean`	A matrix of dimension `ncol(data) x G` containing the mean parameters of each cluster (column-wise).
`cov`	An array of size `ncol(data) x ncol(data) x G` containing the covariance matrices of each cluster.
`tau`	A matrix of dimension `nrow(data) x {1+G}` where `tau[i, 1+j]` is the estimated (improper) posterior probability that the ith observation belongs to the jth cluster. `tau[i,1]` is the estimated (improper) posterior probability that ith observation belongs to the noise component.
`smd`	A matrix of dimension `nrow(data) x G` where `smd[i,j]` is the squared Mahalanobis distance of `data[i,]` from `mean[,j]` according to `cov[,,j]`.
`cluster`	A vector of integers denoting cluster assignments for each observation. It's `0` for observations assigned to noise/outliers.
`size`	A vector of integers with sizes (counts) of each cluster.
`exproportion`	A vector of estimated expected clusters' proportions (see Details).

Author(s)

Pietro Coretto [email protected] https://pietro-coretto.github.io

References

Examples

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## -----------------------------------------------------------------------------
## EXAMPLE 1:
## Perform RIMLE with default inputs
## -----------------------------------------------------------------------------
set.seed(1)
a <- rimle(data = x, G = 2)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what = "fit")
plot(a, what = "fit", cluster = 1)



## -----------------------------------------------------------------------------
## EXAMPLE 2:
## Compare solutions for different choices of logicd
## -----------------------------------------------------------------------------
set.seed(1)

## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model
b1 <- rimle(data = x, G = 2, logicd = -Inf)
plot(b1, data=x, what="clustering")
plot(b1, what="fit")

## Case 2: low noise level
b2 <- rimle(data = x, G = 2, logicd = -100)
plot(b2, data=x, what="clustering")
plot(b2, what="fit")

## Case 3: medium noise level
b3 <- rimle(data = x, G = 2, logicd = -10)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

## Case 3: large noise level
b3 <- rimle(data = x, G = 2, logicd = 5)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## -----------------------------------------------------------------------------
## EXAMPLE 1:
## Perform RIMLE with default inputs
## -----------------------------------------------------------------------------
set.seed(1)
a <- rimle(data = x, G = 2)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what = "fit")
plot(a, what = "fit", cluster = 1)



## -----------------------------------------------------------------------------
## EXAMPLE 2:
## Compare solutions for different choices of logicd
## -----------------------------------------------------------------------------
set.seed(1)

## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model
b1 <- rimle(data = x, G = 2, logicd = -Inf)
plot(b1, data=x, what="clustering")
plot(b1, what="fit")

## Case 2: low noise level
b2 <- rimle(data = x, G = 2, logicd = -100)
plot(b2, data=x, what="clustering")
plot(b2, what="fit")

## Case 3: medium noise level
b3 <- rimle(data = x, G = 2, logicd = -10)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

## Case 3: large noise level
b3 <- rimle(data = x, G = 2, logicd = 5)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

Package 'otrimle'

Help Index

Swiss Banknotes Data

Description

Usage

Format

Source

Generates random data from OTRIMLE output model

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Robust Initialization for Model-based Clustering Methods

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Aggregated distance to elliptical unimodal density over clusters

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Statistic measuring closeness to symmetric unimodal distribution

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Closeness of multivariate distribution to elliptical unimodal distribution

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Mean and standard deviation of unimodality statistic

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Optimally Tuned Robust Improper Maximum Likelihood Clustering

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

OTRIMLE for a range of numbers of clusters with density-based cluster quality statistic

Description

Usage

Arguments

Details

Value