Title: | Flexible Procedures for Clustering |
---|---|
Description: | Various methods for clustering and cluster validation. Fixed point clustering. Linear regression clustering. Clustering by merging Gaussian mixture components. Symmetric and asymmetric discriminant projections for visualisation of the separation of groupings. Cluster validation statistics for distance based clustering including corrected Rand index. Standardisation of cluster validation statistics by random clusterings and comparison between many clustering methods and numbers of clusters based on this. Cluster-wise cluster stability assessment. Methods for estimation of the number of clusters: Calinski-Harabasz, Tibshirani and Walther's prediction strength, Fang and Wang's bootstrap stability. Gaussian/multinomial mixture fitting for mixed continuous/categorical variables. Variable-wise statistics for cluster interpretation. DBSCAN clustering. Interface functions for many clustering methods implemented in R, including estimating the number of clusters with kmeans, pam and clara. Modality diagnosis for Gaussian mixtures. For an overview see package?fpc. |
Authors: | Christian Hennig [aut, cre] |
Maintainer: | Christian Hennig <[email protected]> |
License: | GPL |
Version: | 2.2-13 |
Built: | 2024-10-25 06:47:21 UTC |
Source: | CRAN |
Here is a list of the main functions in package fpc. Most other functions are auxiliary functions for these.
Computes DBSCAN density based clustering as introduced in Ester et al. (1996).
Mahalanobis Fixed Point Clustering, Hennig and Christlieb (2002), Hennig (2005).
Regression Fixed Point Clustering, Hennig (2003).
This fits a latent class model to
data with mixed type continuous/nominal variables. Actually it
calls a method for flexmix
.
Clustering by merging components of a Gaussian mixture, see Hennig (2010).
ML-fit of a mixture of linear regression models, see DeSarbo and Cron (1988).
This computes several cluster validity
statistics from a clustering and a dissimilarity matrix including
the Calinski-Harabasz index, the adjusted Rand index and other
statistics explained in Gordon (1999) as well as several
characterising
measures such as average between cluster and within cluster
dissimilarity and separation. See also calinhara
,
dudahart2
for specific indexes, and a new version
cqcluster.stats
that computes some more indexes and
statistics used for computing them. There's also
distrsimilarity
, which computes within-cluster
dissimilarity to the Gaussian and uniform distribution.
Estimates the number of clusters by computing the prediction strength of a clustering of a dataset into different numbers of components for various clustering methods, see Tibshirani and Walther (2005). In fact, this is more flexible than what is in the original paper, because it can use point classification schemes that work better with clustering methods other than k-means.
Estimates the number of clusters by bootstrap stability selection, see Fang and Wang (2012). This is quite flexible regarding clustering methods and point classification schemes and also allows for dissimilarity data.
This runs many clustering methods (to be
specifed by the user) with many numbers of clusters on a dataset
and produces standardised and comparable versions of many cluster
validity indexes (see Hennig 2019, Akhanli and Hennig 2020).
This is done by means of
producing random clusterings on the given data, see
stupidkcentroids
and stupidknn
. It
allows to compare many
clusterings based on many different potential desirable features
of a clustering. print.valstat
allows to compute an
aggregated index with user-specified weights.
Sets of colours and symbols useful for cluster plotting.
Cluster-wise stability assessment of a clustering. Clusterings are performed on resampled data to see for every cluster of the original dataset how well this is reproduced. See Hennig (2007) for details.
Extracts variable-wise information for every cluster in order to help with cluster interpretation.
Visualisation of a clustering or grouping in data
by various linear projection methods that optimise the separation
between clusters, or between a single cluster and the rest of the
data according to Hennig (2004) including classical methods such
as discriminant coordinates. This calls the function
discrproj
, which is a bit more flexible but doesn't
produce a plot itself.
Plots and diagnostics for assessing modality of Gaussian mixtures, see Ray and Lindsay (2005).
Plots to diagnose component separation in Gaussian mixtures, see Hennig (2010).
Local shape matrix, can be used for finding
clusters in connection with function ics
in package
ICS
, see Hennig's
discussion and rejoinder of Tyler et al. (2009).
This and other "CBI"-functions (see the
kmeansCBI
-help page) are unified wrappers for
various clustering methods in R that may be useful because they do
in one step for what you normally may need to do a bit more in R
(for example fitting a Gaussian mixture with noise component in
package mclust).
This calls kmeans
for the k-means
clustering method and includes estimation of the number of
clusters and finding an optimal solution from several starting
points.
This calls pam
and
clara
for the partitioning around medoids
clustering method (Kaufman and Rouseeuw, 1990) and includes two
different ways of estimating the number of clusters.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
DeSarbo, W. S. and Cron, W. L. (1988) A maximum likelihood methodology for clusterwise linear regression, Journal of Classification 5, 249-282.
Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477.
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Hennig, C. (2003) Clusters, outliers and regression: fixed point clusters, Journal of Multivariate Analysis 86, 183-212.
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics, 13, 930-945 .
Hennig, C. (2005) Fuzzy and Crisp Mahalanobis Fixed Point Clusters, in Baier, D., Decker, R., and Schmidt-Thieme, L. (eds.): Data Analysis and Decision Support. Springer, Heidelberg, 47-56.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Hennig, C. and Christlieb, N. (2002) Validating visual clusters in large datasets: Fixed point clusters of spectral features, Computational Statistics and Data Analysis 40, 723-739.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.
Asymmetric discriminant coordinates as defined in Hennig (2003). Asymmetric discriminant projection means that there are two classes, one of which is treated as the homogeneous class (i.e., it should appear homogeneous and separated in the resulting projection) while the other may be heterogeneous. The principle is to maximize the ratio between the projection of a between classes separation matrix and the projection of the covariance matrix within the homogeneous class.
adcoord(xd, clvecd, clnum=1)
adcoord(xd, clvecd, clnum=1)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
clnum |
integer. Number of the homogeneous class. |
The square root of the homogeneous classes covariance matrix
is inverted by use of
tdecomp
, which can be expected to give
reasonable results for singular within-class covariance matrices.
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
plotcluster
for straight forward discriminant plots.
discrproj
for alternatives.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) adcf <- adcoord(face,grface==2) adcf2 <- adcoord(face,grface==4) plot(adcf$proj,col=1+(grface==2)) plot(adcf2$proj,col=1+(grface==4)) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) adcf <- adcoord(face,grface==2) adcf2 <- adcoord(face,grface==4) plot(adcf$proj,col=1+(grface==2)) plot(adcf2$proj,col=1+(grface==4)) # ...done in one step by function plotcluster.
Asymmetric neighborhood based discriminant coordinates as defined in Hennig (2003). Asymmetric discriminant projection means that there are two classes, one of which is treated as the homogeneous class (i.e., it should appear homogeneous and separated in the resulting projection) while the other may be heterogeneous. The principle is to maximize the ratio between the projection of a between classes covariance matrix, which is defined by averaging the between classes covariance matrices in the neighborhoods of the points of the homogeneous class and the projection of the covariance matrix within the homogeneous class.
ancoord(xd, clvecd, clnum=1, nn=50, method="mcd", countmode=1000, ...)
ancoord(xd, clvecd, clnum=1, nn=50, method="mcd", countmode=1000, ...)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
clnum |
integer. Number of the homogeneous class. |
nn |
integer. Number of points which belong to the neighborhood of each point (including the point itself). |
method |
one of
"mve", "mcd" or "classical". Covariance matrix used within the
homogeneous class.
"mcd" and "mve" are robust covariance matrices as implemented
in |
countmode |
optional positive integer. Every |
... |
no effect |
The square root of the homogeneous classes covariance matrix
is inverted by use of
tdecomp
, which can be expected to give
reasonable results for singular within-class covariance matrices.
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
plotcluster
for straight forward discriminant plots.
discrproj
for alternatives.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) ancf2 <- ancoord(face,grface==4) plot(ancf2$proj,col=1+(grface==4)) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) ancf2 <- ancoord(face,grface==4) plot(ancf2$proj,col=1+(grface==4)) # ...done in one step by function plotcluster.
Asymmetric weighted discriminant coordinates as defined in Hennig (2003). Asymmetric discriminant projection means that there are two classes, one of which is treated as the homogeneous class (i.e., it should appear homogeneous and separated in the resulting projection) while the other may be heterogeneous. The principle is to maximize the ratio between the projection of a between classes separation matrix and the projection of the covariance matrix within the homogeneous class. Points are weighted according to their (robust) Mahalanobis distance to the homogeneous class.
awcoord(xd, clvecd, clnum=1, mahal="square", method="classical", clweight=switch(method,classical=FALSE,TRUE), alpha=0.99, subsample=0, countmode=1000, ...)
awcoord(xd, clvecd, clnum=1, mahal="square", method="classical", clweight=switch(method,classical=FALSE,TRUE), alpha=0.99, subsample=0, countmode=1000, ...)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
clnum |
integer. Number of the homogeneous class. |
mahal |
"md" or "square". If "md", the points are weighted by the
square root of the |
method |
one of
"mve", "mcd" or "classical". Covariance matrix used within the
homogeneous class and for the computation of the Mahalanobis distances.
"mcd" and "mve" are robust covariance matrices as implemented
in |
clweight |
logical. If |
alpha |
numeric between 0 and 1. The corresponding quantile of the chi squared distribution is used for the downweighting of points. Points with a smaller Mahalanobis distance to the homogeneous class get full weight. |
subsample |
integer. If 0, all points are used. Else, only a
subsample of |
countmode |
optional positive integer. Every |
... |
no effect |
The square root of the homogeneous classes covariance matrix
is inverted by use of
tdecomp
, which can be expected to give
reasonable results for singular within-class covariance matrices.
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
plotcluster
for straight forward discriminant plots.
discrproj
for alternatives.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) awcf <- awcoord(face,grface==1) # awcf2 <- ancoord(face,grface==1, method="mcd") plot(awcf$proj,col=1+(grface==1)) # plot(awcf2$proj,col=1+(grface==1)) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) awcf <- awcoord(face,grface==1) # awcf2 <- ancoord(face,grface==1, method="mcd") plot(awcf$proj,col=1+(grface==1)) # plot(awcf2$proj,col=1+(grface==1)) # ...done in one step by function plotcluster.
Computes Bhattacharyya discriminant projection coordinates as described in Fukunaga (1990), p. 455 ff.
batcoord(xd, clvecd, clnum=1, dom="mean") batvarcoord(xd, clvecd, clnum=1)
batcoord(xd, clvecd, clnum=1, dom="mean") batvarcoord(xd, clvecd, clnum=1)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer or logical vector of class numbers; length must equal
|
clnum |
integer, one of the values of |
dom |
string. |
batvarcoord
computes the optimal projection coordinates with
respect to the difference in variances. batcoord
combines the
differences in mean and variance as explained for the argument dom
.
batcoord
returns a list with the components ev, rev,
units, proj
. batvarcoord
returns a list with the components
ev, rev, units, proj, W, S1, S2
.
ev |
vector of eigenvalues. If |
rev |
for |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
W |
matrix |
S1 |
covariance matrix of the first class. |
S2 |
covariance matrix of the second class. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Boston: Academic Press.
plotcluster
for straight forward discriminant plots.
discrcoord
for discriminant coordinates.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) bcf2 <- batcoord(face,grface==2) plot(bcf2$proj,col=1+(grface==2)) bcfv2 <- batcoord(face,grface==2,dom="variance") plot(bcfv2$proj,col=1+(grface==2)) bcfvv2 <- batvarcoord(face,grface==2) plot(bcfvv2$proj,col=1+(grface==2))
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) bcf2 <- batcoord(face,grface==2) plot(bcf2$proj,col=1+(grface==2)) bcfv2 <- batcoord(face,grface==2,dom="variance") plot(bcfv2$proj,col=1+(grface==2)) bcfvv2 <- batvarcoord(face,grface==2) plot(bcfvv2$proj,col=1+(grface==2))
Computes Bhattacharyya distance between two multivariate Gaussian distributions. See Fukunaga (1990).
bhattacharyya.dist(mu1, mu2, Sigma1, Sigma2)
bhattacharyya.dist(mu1, mu2, Sigma1, Sigma2)
mu1 |
mean vector of component 1. |
mu2 |
mean vector of component 2. |
Sigma1 |
covariance matrix of component 1. |
Sigma2 |
covariance matrix of component 2. |
The Bhattacharyya distance between the two Gaussian distributions.
Thanks to David Pinto for improving this function.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Fukunaga, K. (1990) Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, New York.
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
round(bhattacharyya.dist(c(1,1),c(2,5),diag(2),diag(2)),digits=2)
round(bhattacharyya.dist(c(1,1),c(2,5),diag(2),diag(2)),digits=2)
Computes Bhattachryya distances for pairs of components given the parameters of a Gaussian mixture.
bhattacharyya.matrix(muarray,Sigmaarray,ipairs="all", misclassification.bound=TRUE)
bhattacharyya.matrix(muarray,Sigmaarray,ipairs="all", misclassification.bound=TRUE)
muarray |
matrix of component means (different components are in different columns). |
Sigmaarray |
three dimensional array with component covariance matrices (the third dimension refers to components). |
ipairs |
|
misclassification.bound |
logical. If |
A matrix with Bhattacharyya distances (or derived misclassification
bounds, see above) between pairs of Gaussian distributions with the
provided parameters. If ipairs!="all"
, the Bhattacharyya
distance and the misclassification bound are given as NA
for
pairs not included in ipairs
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Fukunaga, K. (1990) Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, New York.
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
muarray <-cbind(c(0,0),c(0,0.1),c(10,10)) sigmaarray <- array(c(diag(2),diag(2),diag(2)),dim=c(2,2,3)) bhattacharyya.matrix(muarray,sigmaarray,ipairs=list(c(1,2),c(2,3)))
muarray <-cbind(c(0,0),c(0,0.1),c(10,10)) sigmaarray <- array(c(diag(2),diag(2),diag(2)),dim=c(2,2,3)) bhattacharyya.matrix(muarray,sigmaarray,ipairs=list(c(1,2),c(2,3)))
Calinski-Harabasz index for estimating the number of clusters,
based on an observations/variables-matrix here. A distance based
version is available through cluster.stats
.
calinhara(x,clustering,cn=max(clustering))
calinhara(x,clustering,cn=max(clustering))
x |
data matrix or data frame. |
clustering |
vector of integers. Clustering. |
cn |
integer. Number of clusters. |
Calinski-Harabasz statistic, which is
(n-cn)*sum(diag(B))/((cn-1)*sum(diag(W)))
. B being the
between-cluster means,
and W being the within-clusters covariance matrix.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
set.seed(98765) iriss <- iris[sample(150,20),-5] km <- kmeans(iriss,3) round(calinhara(iriss,km$cluster),digits=2)
set.seed(98765) iriss <- iris[sample(150,20),-5] km <- kmeans(iriss,3) round(calinhara(iriss,km$cluster),digits=2)
Generates tuning constants ca
for fixreg
dependent on
the number of points and variables of the dataset.
Only thought for use in fixreg
.
can(n, p)
can(n, p)
n |
positive integer. Number of points. |
p |
positive integer. Number of independent variables. |
The formula is
. For
justification cf. Hennig (2002).
A number.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
can(429,3)
can(429,3)
Recodes a dataset with nominal variables so that the nominal variables are replaced by binary variables for the categories.
cat2bin(x,categorical=NULL)
cat2bin(x,categorical=NULL)
x |
data matrix or data frame. The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables could take numbers or strings or anything that can be coerced to factor levels as values.) |
categorical |
vector of numbers of variables to be recoded. |
A list with components
data |
data matrix with variables specified in |
variableinfo |
list of lists. One list for every variable in the
original dataset, with four components each, namely |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <-cbind(v1,v2,d1,d2) lc <- cat2bin(ldata,categorical=3:4)
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <-cbind(v1,v2,d1,d2) lc <- cat2bin(ldata,categorical=3:4)
CDbw-index for cluster validation, as defined in Halkidi and Vazirgiannis (2008), Halkidi et al. (2015).
cdbw(x,clustering,r=10,s=seq(0.1,0.8,by=0.1), clusterstdev=TRUE,trace=FALSE)
cdbw(x,clustering,r=10,s=seq(0.1,0.8,by=0.1), clusterstdev=TRUE,trace=FALSE)
x |
something that can be coerced into a numerical matrix. Euclidean dataset. |
clustering |
vector of integers with length |
r |
integer. Number of cluster border representatives. |
s |
numerical vector of shrinking factors (between 0 and 1). |
clusterstdev |
logical. If |
trace |
logical. If |
List with components (see Halkidi and Vazirgiannis (2008), Halkidi et al. (2015) for details)
cdbw |
value of CDbw index (the higher the better). |
cohesion |
cohesion. |
compactness |
compactness. |
sep |
separation. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Halkidi, M. and Vazirgiannis, M. (2008) A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters 29, 773-786.
Halkidi, M., Vazirgiannis, M. and Hennig, C. (2015) Method-independent
indices for cluster validation. In C. Hennig, M. Meila, F. Murtagh,
R. Rocci (eds.) Handbook of Cluster Analysis, CRC
Press/Taylor &
Francis, Boca Raton.
options(digits=3) iriss <- as.matrix(iris[c(1:5,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:5,51:55,101:105),5]) cdbw(iriss,irisc)
options(digits=3) iriss <- as.matrix(iris[c(1:5,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:5,51:55,101:105),5]) cdbw(iriss,irisc)
Standardises cluster validity statistics as produced by
clustatsum
relative to results that were achieved by
random clusterings on the same data by
randomclustersim
. The aim is to make differences between
values comparable between indexes, see Hennig (2019), Akhanli and
Hennig (2020).
This is mainly for use within clusterbenchstats
.
cgrestandard(clusum,clusim,G,percentage=FALSE, useallmethods=FALSE, useallg=FALSE, othernc=list())
cgrestandard(clusum,clusim,G,percentage=FALSE, useallmethods=FALSE, useallg=FALSE, othernc=list())
clusum |
object of class "valstat", see |
clusim |
list; output object of |
G |
vector of integers. Numbers of clusters to consider. |
percentage |
logical. If |
useallmethods |
logical. If |
useallg |
logical. If |
othernc |
list of integer vectors of length 2. This allows the
incorporation of methods that bring forth other numbers of clusters
than those in |
cgrestandard
will add a statistic named dmode
to the
input set of validation statistics, which is defined as
0.75*dindex+0.25*highdgap
, aggregating these two closely
related statistics, see clustatsum
.
List of class "valstat"
, see
valstat.object
, with standardised results as
explained above.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
valstat.object
, clusterbenchstats
, stupidkcentroids
, stupidknn
, stupidkfn
, stupidkaven
, clustatsum
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) dif <- dist(face) clusum <- list() clusum[[2]] <- list() cl12 <- kmeansCBI(face,2) cl13 <- kmeansCBI(face,3) cl22 <- claraCBI(face,2) cl23 <- claraCBI(face,2) ccl12 <- clustatsum(dif,cl12$partition) ccl13 <- clustatsum(dif,cl13$partition) ccl22 <- clustatsum(dif,cl22$partition) ccl23 <- clustatsum(dif,cl23$partition) clusum[[1]] <- list() clusum[[1]][[2]] <- ccl12 clusum[[1]][[3]] <- ccl13 clusum[[2]][[2]] <- ccl22 clusum[[2]][[3]] <- ccl23 clusum$maxG <- 3 clusum$minG <- 2 clusum$method <- c("kmeansCBI","claraCBI") clusum$name <- c("kmeansCBI","claraCBI") clusim <- randomclustersim(dist(face),G=2:3,nnruns=1,kmruns=1, fnruns=1,avenruns=1,monitor=FALSE) cgr <- cgrestandard(clusum,clusim,2:3) cgr2 <- cgrestandard(clusum,clusim,2:3,useallg=TRUE) cgr3 <- cgrestandard(clusum,clusim,2:3,percentage=TRUE) print(str(cgr)) print(str(cgr2)) print(cgr3[[1]][[2]])
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) dif <- dist(face) clusum <- list() clusum[[2]] <- list() cl12 <- kmeansCBI(face,2) cl13 <- kmeansCBI(face,3) cl22 <- claraCBI(face,2) cl23 <- claraCBI(face,2) ccl12 <- clustatsum(dif,cl12$partition) ccl13 <- clustatsum(dif,cl13$partition) ccl22 <- clustatsum(dif,cl22$partition) ccl23 <- clustatsum(dif,cl23$partition) clusum[[1]] <- list() clusum[[1]][[2]] <- ccl12 clusum[[1]][[3]] <- ccl13 clusum[[2]][[2]] <- ccl22 clusum[[2]][[3]] <- ccl23 clusum$maxG <- 3 clusum$minG <- 2 clusum$method <- c("kmeansCBI","claraCBI") clusum$name <- c("kmeansCBI","claraCBI") clusim <- randomclustersim(dist(face),G=2:3,nnruns=1,kmruns=1, fnruns=1,avenruns=1,monitor=FALSE) cgr <- cgrestandard(clusum,clusim,2:3) cgr2 <- cgrestandard(clusum,clusim,2:3,useallg=TRUE) cgr3 <- cgrestandard(clusum,clusim,2:3,percentage=TRUE) print(str(cgr)) print(str(cgr2)) print(cgr3[[1]][[2]])
Various methods for classification of unclustered points from
clustered points for use within functions nselectboot
and prediction.strength
.
classifdist(cdist,clustering, method="averagedist", centroids=NULL,nnk=1) classifnp(data,clustering, method="centroid",cdist=NULL, centroids=NULL,nnk=1)
classifdist(cdist,clustering, method="averagedist", centroids=NULL,nnk=1) classifnp(data,clustering, method="centroid",cdist=NULL, centroids=NULL,nnk=1)
cdist |
dissimilarity matrix or |
data |
something that can be coerced into a an
|
clustering |
integer vector. Gives the cluster number (between 1 and k for k clusters) for clustered points and should be -1 for points to be classified. |
method |
one of |
centroids |
for |
nnk |
number of nearest neighbours if |
classifdist
is for data given as dissimilarity matrix,
classifnp
is for data given as n times p data matrix.
The following methods are supported:
assigns observations to the cluster with closest
cluster centroid as specified in argument centroids
(this
is associated to k-means and pam/clara-clustering).
only in classifnp
. Classifies by quadratic
discriminant analysis (this is associated to Gaussian clusters
with flexible covariance matrices), calling
qda
with default settings. If
qda
gives an error (usually because a class
was too small), lda
is used.
only in classifnp
. Classifies by linear
discriminant analysis (this is associated to Gaussian clusters
with equal covariance matrices), calling
lda
with default settings.
assigns to the cluster to which an observation has the minimum average dissimilarity to all points in the cluster (this is associated with average linkage clustering).
classifies by nnk
nearest neighbours (for
nnk=1
, this is associated with single linkage clustering).
Calls knn
in classifnp
.
classifies by the minimum distance to the farthest neighbour. This is associated with complete linkage clustering).
An integer vector giving cluster numbers for all observations; those for the observations already clustered in the input are the same as in the input.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
prediction.strength
, nselectboot
set.seed(20000) x1 <- rnorm(50) y <- rnorm(100) x2 <- rnorm(40,mean=20) x3 <- rnorm(10,mean=25,sd=100) x <-cbind(c(x1,x2,x3),y) truec <- c(rep(1,50),rep(2,40),rep(3,10)) topredict <- c(1,2,51,52,91) clumin <- truec clumin[topredict] <- -1 classifnp(x,clumin, method="averagedist") classifnp(x,clumin, method="qda") classifdist(dist(x),clumin, centroids=c(3,53,93),method="centroid") classifdist(dist(x),clumin,method="knn")
set.seed(20000) x1 <- rnorm(50) y <- rnorm(100) x2 <- rnorm(40,mean=20) x3 <- rnorm(10,mean=25,sd=100) x <-cbind(c(x1,x2,x3),y) truec <- c(rep(1,50),rep(2,40),rep(3,10)) topredict <- c(1,2,51,52,91) clumin <- truec clumin[topredict] <- -1 classifnp(x,clumin, method="averagedist") classifnp(x,clumin, method="qda") classifdist(dist(x),clumin, centroids=c(3,53,93),method="centroid") classifdist(dist(x),clumin,method="knn")
clucols
gives out a vector of different random colours.
clugrey
gives out a vector of equidistant grey scales.
clusym
is a vector of different symbols starting from "1",
"2",...
clucols(i, seed=NULL) clugrey(i,max=0.9) clusym
clucols(i, seed=NULL) clugrey(i,max=0.9) clusym
i |
integer. Length of output vector (number of clusters). |
seed |
integer. Random seed. |
max |
between 0 and 1. Maximum grey scale value, see
|
clucols
gives out a vector of different random colours.
clugrey
gives out a vector of equidistant grey scales.
clusym
is a vector of different characters starting from "1",
"2",...
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
set.seed(112233) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=3, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) plot(Cars934[,c(2,3)],col=clucols(3)[fcc@cluster],pch=clusym[fcc@cluster])
set.seed(112233) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=3, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) plot(Cars934[,c(2,3)],col=clucols(3)[fcc@cluster],pch=clusym[fcc@cluster])
Jaccard similarity between logical or 0-1 vectors:
sum(c1 & c2)/sum(c1 | c2)
.
clujaccard(c1,c2,zerobyzero=NA)
clujaccard(c1,c2,zerobyzero=NA)
c1 |
logical or 0-1-vector. |
c2 |
logical or 0-1-vector (same length). |
zerobyzero |
result if |
Numeric between 0 and 1.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
c1 <- rep(TRUE,10) c2 <- c(FALSE,rep(TRUE,9)) clujaccard(c1,c2)
c1 <- rep(TRUE,10) c2 <- c(FALSE,rep(TRUE,9)) clujaccard(c1,c2)
A rough approximation of the expectation of the number of times a well
separated fixed point
cluster (FPC) of size n
is found in ir
fixed point
iterations of fixreg
.
clusexpect(n, p, cn, ir)
clusexpect(n, p, cn, ir)
n |
positive integer. Total number of points. |
p |
positive integer. Number of independent variables. |
cn |
positive integer smaller or equal to |
ir |
positive integer. Number of fixed point iterations. |
The approximation is based on the assumption that a well separated FPC
is found iff all p+2
points of the initial coinfiguration come
from the FPC. The value is ir
times the probability for
this. For a discussion of this assumption cf. Hennig (2002).
A number.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
round(clusexpect(500,4,150,2000),digits=2)
round(clusexpect(500,4,150,2000),digits=2)
clustatsum
computes cluster validation statistics by running
cqcluster.stats
,
and potentially distrsimilarity
, and collecting some key
statistics values with a somewhat different nomenclature.
This was implemented as a helper function for use inside of
clusterbenchstats
and cgrestandard
.
clustatsum(datadist=NULL,clustering,noisecluster=FALSE, datanp=NULL,npstats=FALSE,useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, cbmethod=NULL,methodpars=NULL, distmethod=NULL,dnnk=2, pamcrit=TRUE,...)
clustatsum(datadist=NULL,clustering,noisecluster=FALSE, datanp=NULL,npstats=FALSE,useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, cbmethod=NULL,methodpars=NULL, distmethod=NULL,dnnk=2, pamcrit=TRUE,...)
datadist |
distances on which validation-measures are based, |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
noisecluster |
logical. If |
datanp |
optional observations times variables data matrix, see
|
npstats |
logical. If |
useboot |
logical. If |
bootclassif |
If |
bootmethod |
either |
bootruns |
integer. Number of resampling runs. If
|
cbmethod |
CBI-function (see |
methodpars |
parameters to be passed on to |
distmethod |
logical. In case of |
dnnk |
|
pamcrit |
|
... |
further arguments to be passed on to
|
clustatsum
returns a list. The components, as listed below, are
outputs of summary.cquality
with default parameters,
which means that they are partly transformed versions of those given
out by cqcluster.stats
, i.e., their range is between 0
and 1 and large values are good. Those from
distrsimilarity
are computed with
largeisgood=TRUE
, correspondingly.
avewithin |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
mnnd |
average distance to |
cvnnd |
coefficient of variation of dissimilarities to
|
maxdiameter |
maximum cluster diameter. |
widestgap |
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter |
sindex |
separation index, see argument |
minsep |
minimum cluster separation. |
asw |
average silhouette
width. See |
dindex |
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019). |
denscut |
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019). |
highdgap |
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019). |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
withinss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
pamc |
average distance to cluster centroid. |
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
boot |
if |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
cqcluster.stats
, distrsimilarity
set.seed(20000) options(digits=3) face <- rFace(20,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) clustatsum(dface,complete3)
set.seed(20000) options(digits=3) face <- rFace(20,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) clustatsum(dface,complete3)
Runs a user-specified set of clustering methods (CBI-functions, see
kmeansCBI
with several numbers of clusters on a dataset
with unified output.
cluster.magazine(data,G,diss = inherits(data, "dist"), scaling=TRUE, clustermethod, distmethod=rep(TRUE,length(clustermethod)), ncinput=rep(TRUE,length(clustermethod)), clustermethodpars, trace=TRUE)
cluster.magazine(data,G,diss = inherits(data, "dist"), scaling=TRUE, clustermethod, distmethod=rep(TRUE,length(clustermethod)), ncinput=rep(TRUE,length(clustermethod)), clustermethodpars, trace=TRUE)
data |
data matrix or |
G |
vector of integers. Numbers of clusters to consider. |
diss |
logical. If |
scaling |
either a logical or a numeric vector of length equal to
the number of columns of |
clustermethod |
vector of strings specifying names of
CBI-functions (see |
distmethod |
vector of logicals, of the same length as
|
ncinput |
vector of logicals, of the same length as
|
clustermethodpars |
list of the same length as
|
trace |
logical. If |
List of lists comprising
output |
Two-dimensional list. The first list index i is the number
of the clustering method (ordering as specified in
|
clustering |
Two-dimensional list. The first list index i is the number
of the clustering method (ordering as specified in
|
noise |
Two-dimensional list. The first list index i is the number
of the clustering method (ordering as specified in
|
othernc |
list of integer vectors of length 2. The first number is
the number of the clustering method (the order is determined by
argument |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- clustermethodpars[[3]] <- list() clustermethodpars[[2]]$method <- "complete" clustermethodpars[[3]]$method <- "average" cmf <- cluster.magazine(face,G=2:3,clustermethod=clustermethod, distmethod=rep(FALSE,3),clustermethodpars=clustermethodpars) print(str(cmf))
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- clustermethodpars[[3]] <- list() clustermethodpars[[2]]$method <- "complete" clustermethodpars[[3]]$method <- "average" cmf <- cluster.magazine(face,G=2:3,clustermethod=clustermethod, distmethod=rep(FALSE,3),clustermethodpars=clustermethodpars) print(str(cmf))
Computes a number of distance based statistics, which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, biggest within cluster gap, average silhouette widths, the Calinski and Harabasz index, a Pearson version of Hubert's gamma coefficient, the Dunn index and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.
cluster.stats(d = NULL, clustering, alt.clustering = NULL, noisecluster=FALSE, silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap=TRUE, sepindex=TRUE, sepprob=0.1, sepwithnoise=TRUE, compareonly = FALSE, aggregateonly = FALSE)
cluster.stats(d = NULL, clustering, alt.clustering = NULL, noisecluster=FALSE, silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap=TRUE, sepindex=TRUE, sepprob=0.1, sepwithnoise=TRUE, compareonly = FALSE, aggregateonly = FALSE)
d |
a distance object (as generated by |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
alt.clustering |
an integer vector such as for
|
noisecluster |
logical. If |
silhouette |
logical. If |
G2 |
logical. If |
G3 |
logical. If |
wgap |
logical. If |
sepindex |
logical. If |
sepprob |
numerical between 0 and 1, see |
sepwithnoise |
logical. If |
compareonly |
logical. If |
aggregateonly |
logical. If |
cluster.stats
returns a list containing the components
n, cluster.number, cluster.size, min.cluster.size, noisen,
diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, average.between, average.within,
n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, pearsongamma, dunn, entropy, wb.ratio, ch, cwidegap,
widestgap, sindex,
corrected.rand, vi
except if compareonly=TRUE
, in which case
only the last two components are computed.
n |
number of cases. |
cluster.number |
number of clusters. |
cluster.size |
vector of cluster sizes (number of points). |
min.cluster.size |
size of smallest cluster. |
noisen |
number of noise points, see argument |
diameter |
vector of cluster diameters (maximum within cluster distances). |
average.distance |
vector of clusterwise within cluster average distances. |
median.distance |
vector of clusterwise within cluster distance medians. |
separation |
vector of clusterwise minimum distances of a point in the cluster to a point of another cluster. |
average.toother |
vector of clusterwise average distances of a point in the cluster to the points of other clusters. |
separation.matrix |
matrix of separation values between all pairs of clusters. |
ave.between.matrix |
matrix of mean dissimilarities between points of every pair of clusters. |
average.between |
average distance between clusters. |
average.within |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
n.between |
number of distances between clusters. |
n.within |
number of distances within clusters. |
max.diameter |
maximum cluster diameter. |
min.separation |
minimum cluster separation. |
within.cluster.ss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
clus.avg.silwidths |
vector of cluster average silhouette
widths. See
|
avg.silwidth |
average silhouette
width. See
|
g2 |
Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62). |
g3 |
G3 coefficient. See Gordon (1999, p. 62). |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
dunn |
minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002). |
dunn2 |
minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes. |
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
wb.ratio |
|
ch |
Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2013). |
cwidegap |
vector of widest within-cluster gaps. |
widestgap |
widest within-cluster gap. |
sindex |
separation index, see argument |
corrected.rand |
corrected Rand index (if |
vi |
variation of information (VI) index (if |
Because cluster.stats
processes a full dissimilarity matrix, it
isn't suitable for large data sets. You may consider
distcritmulti
in that case.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. In: Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): "Data Analysis, Machine Learning and Knowledge Discovery", Springer, Berlin, 41-49.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
cqcluster.stats
is a more sophisticated version of
cluster.stats
with more options.
silhouette
, dist
, calinhara
,
distcritmulti
.
clusterboot
computes clusterwise stability statistics by
resampling.
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))
This function gives some helpful variable-wise information for cluster interpretation, given a clustering and a data set. The output object contains some tables. For categorical variables, tables compare clusterwise distributions with overall distributions. Continuous variables are categorised for this.
If desired, tables, histograms, some standard statistics of
continuous variables and validation plots as available through
discrproj
(Hennig 2004) are given out on the fly.
cluster.varstats(clustering,vardata,contdata=vardata, clusterwise=TRUE, tablevar=NULL,catvar=NULL, quantvar=NULL, catvarcats=10, proportions=FALSE, projmethod="none",minsize=ncol(contdata)+2, ask=TRUE,rangefactor=1) ## S3 method for class 'varwisetables' print(x,digits=3,...)
cluster.varstats(clustering,vardata,contdata=vardata, clusterwise=TRUE, tablevar=NULL,catvar=NULL, quantvar=NULL, catvarcats=10, proportions=FALSE, projmethod="none",minsize=ncol(contdata)+2, ask=TRUE,rangefactor=1) ## S3 method for class 'varwisetables' print(x,digits=3,...)
clustering |
vector of integers. Clustering (needs to be in standard coding, 1,2,...). |
vardata |
data matrix or data frame of which variables are summarised. |
contdata |
variable matrix or data frame, normally all or some
variables from |
clusterwise |
logical. If |
tablevar |
vector of integers. Numbers of variables treated as
categorical (i.e., no histograms and statistics, just tables) if
|
catvar |
vector of integers. Numbers of variables to be categorised by proportional quantiles for table computation. Recommended for all continuous variables. |
quantvar |
vector of integers. Variables for which means,
standard deviations and quantiles should be given out if
|
catvarcats |
integer. Number of categories used for
categorisation of variables specified in |
proportions |
logical. If |
projmethod |
one of |
minsize |
integer. Projection is not carried out for clusters with fewer points than this. (If this is chosen smaller, it may lead to errors with some projection methods.) |
ask |
logical. If |
rangefactor |
numeric. Factor by which to multiply the range for projection plot ranges. |
x |
an object of class |
digits |
integer. Number of digits after the decimal point to print out. |
... |
not used. |
An object of class "varwisetables"
, which is a
list with a table for each variable, giving (categorised) marginal
distributions by cluster.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
set.seed(112233) options(digits=3) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=2, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) cv <- cluster.varstats(fcc@cluster,Cars934, contdata=Cars934[,c(2,3)], tablevar=c(1,4),catvar=c(2,3),quantvar=c(2,3),projmethod="awc", ask=FALSE) print(cv)
set.seed(112233) options(digits=3) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=2, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) cv <- cluster.varstats(fcc@cluster,Cars934, contdata=Cars934[,c(2,3)], tablevar=c(1,4),catvar=c(2,3),quantvar=c(2,3),projmethod="awc", ask=FALSE) print(cv)
This runs the methodology explained in Hennig (2019), Akhanli and
Hennig (2020). It runs a
user-specified set of clustering methods (CBI-functions, see
kmeansCBI
) with several numbers of clusters on a dataset,
and computes many cluster validation indexes. In order to explore the
variation of these indexes, random clusterings on the data are
generated, and validation indexes are standardised by use of the
random clusterings in order to make them comparable and differences
between values interpretable.
The function print.valstat
can be used to provide
weights for the cluster
validation statistics, and will then compute a weighted validation index
that can be used to compare all clusterings.
See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).
clusterbenchstats(data,G,diss = inherits(data, "dist"), scaling=TRUE, clustermethod, methodnames=clustermethod, distmethod=rep(TRUE,length(clustermethod)), ncinput=rep(TRUE,length(clustermethod)), clustermethodpars, npstats=FALSE, useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, trace=TRUE, pamcrit=TRUE,snnk=2, dnnk=2, nnruns=100,kmruns=100,fnruns=100,avenruns=100, multicore=FALSE,cores=detectCores()-1, useallmethods=TRUE, useallg=FALSE,...) ## S3 method for class 'clusterbenchstats' print(x,...)
clusterbenchstats(data,G,diss = inherits(data, "dist"), scaling=TRUE, clustermethod, methodnames=clustermethod, distmethod=rep(TRUE,length(clustermethod)), ncinput=rep(TRUE,length(clustermethod)), clustermethodpars, npstats=FALSE, useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, trace=TRUE, pamcrit=TRUE,snnk=2, dnnk=2, nnruns=100,kmruns=100,fnruns=100,avenruns=100, multicore=FALSE,cores=detectCores()-1, useallmethods=TRUE, useallg=FALSE,...) ## S3 method for class 'clusterbenchstats' print(x,...)
data |
data matrix or |
G |
vector of integers. Numbers of clusters to consider. |
diss |
logical. If |
scaling |
either a logical or a numeric vector of length equal to
the number of columns of |
clustermethod |
vector of strings specifying names of
CBI-functions (see |
methodnames |
vector of strings with user-chosen names for
clustering methods, one for every method in
|
distmethod |
vector of logicals, of the same length as
|
ncinput |
vector of logicals, of the same length as
|
clustermethodpars |
list of the same length as
|
npstats |
logical. If |
useboot |
logical. If |
bootclassif |
If |
bootmethod |
either |
bootruns |
integer. Number of resampling runs. If
|
trace |
logical. If |
pamcrit |
logical. If |
snnk |
integer. Number of neighbours used in coefficient of
variation of distance to nearest within cluster neighbour, the
|
dnnk |
integer. Number of nearest neighbors to use for
dissimilarity to the uniform in case that |
nnruns |
integer. Number of runs of |
kmruns |
integer. Number of runs of
|
fnruns |
integer. Number of runs of |
avenruns |
integer. Number of runs of |
multicore |
logical. If |
cores |
integer. Number of cores for parallelisation. |
useallmethods |
logical, to be passed on to
|
useallg |
logical to be passed on to
|
... |
further arguments to be passed on to
|
x |
object of class |
The output of clusterbenchstats
is a
big list of lists comprising lists cm, stat, sim, qstat,
sstat
cm |
output object of |
.
stat |
object of class |
sim |
output object of |
qstat |
object of class |
sstat |
object of class |
This may require a lot of computing time and also memory for datasets that are not small, as most indexes require computation and storage of distances.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
valstat.object
,
cluster.magazine
, kmeansCBI
,
cqcluster.stats
, clustatsum
,
cgrestandard
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- list() clustermethodpars[[2]]$method <- "average" # Last element of clustermethodpars needs to have an entry! methodname <- c("kmeans","average") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1) print(cbs) print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1)) # The weights are weights for the validation statistics ordered as in # cbs$qstat$statistics for computation of an aggregated index, see # ?print.valstat. # Now using bootstrap stability assessment as in Akhanli and Hennig (2020): bootclassif <- c("centroid","averagedist") cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars, useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot", bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE) print(cbsboot) ## Not run: # Index A1 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0)) # Index A2 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0)) ## End(Not run) # Results from nselectboot: plot(cbsboot$stat,cbsboot$sim,statistic="boot")
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- list() clustermethodpars[[2]]$method <- "average" # Last element of clustermethodpars needs to have an entry! methodname <- c("kmeans","average") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1) print(cbs) print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1)) # The weights are weights for the validation statistics ordered as in # cbs$qstat$statistics for computation of an aggregated index, see # ?print.valstat. # Now using bootstrap stability assessment as in Akhanli and Hennig (2020): bootclassif <- c("centroid","averagedist") cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars, useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot", bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE) print(cbsboot) ## Not run: # Index A1 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0)) # Index A2 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0)) ## End(Not run) # Results from nselectboot: plot(cbsboot$stat,cbsboot$sim,statistic="boot")
Assessment of the clusterwise stability of a clustering of data, which can be cases*variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). The methods are described in Hennig (2007).
clusterboot
is an integrated function that computes the
clustering as well, using interface functions for various
clustering methods implemented in R (several interface functions are
provided, but you can
implement further ones for your favourite clustering method). See the
documentation of the input parameter clustermethod
below.
Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). Fuzzy clusterings cannot be processed and have to be transformed to crisp clusterings by the interface function.
clusterboot(data,B=100, distances=(inherits(data, "dist")), bootmethod="boot", bscompare=TRUE, multipleboot=FALSE, jittertuning=0.05, noisetuning=c(0.05,4), subtuning=floor(nrow(data)/2), clustermethod,noisemethod=FALSE,count=TRUE, showplots=FALSE,dissolution=0.5, recover=0.75,seed=NULL,datatomatrix=TRUE,...) ## S3 method for class 'clboot' print(x,statistics=c("mean","dissolution","recovery"),...) ## S3 method for class 'clboot' plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)
clusterboot(data,B=100, distances=(inherits(data, "dist")), bootmethod="boot", bscompare=TRUE, multipleboot=FALSE, jittertuning=0.05, noisetuning=c(0.05,4), subtuning=floor(nrow(data)/2), clustermethod,noisemethod=FALSE,count=TRUE, showplots=FALSE,dissolution=0.5, recover=0.75,seed=NULL,datatomatrix=TRUE,...) ## S3 method for class 'clboot' print(x,statistics=c("mean","dissolution","recovery"),...) ## S3 method for class 'clboot' plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)
data |
by default something that can be coerced into a
(numerical) matrix (data frames with non-numerical data are allowed
when using |
B |
integer. Number of resampling runs for each scheme, see
|
distances |
logical. If |
bootmethod |
vector of strings, defining the methods used for resampling. Possible methods:
Important: only the methods The results in Hennig (2007) indicate that |
bscompare |
logical. If |
multipleboot |
logical. If |
jittertuning |
positive numeric. Tuning for the
|
noisetuning |
A vector of two positive numerics. Tuning for the
|
subtuning |
integer. Size of subsets for |
clustermethod |
an interface function (the function name, not a string containing the name, has to be provided!). This defines the clustering method. See the "Details"-section for a list of available interface functions and guidelines how to write your own ones. |
noisemethod |
logical. If |
count |
logical. If |
showplots |
logical. If |
dissolution |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is smaller or equal to this value, the cluster is considered as "dissolved". Numbers of dissolved clusters are recorded. |
recover |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is larger than this value, the cluster is considered as "successfully recovered". Numbers of recovered clusters are recorded. |
seed |
integer. Seed for random generator (fed into
|
datatomatrix |
logical. If |
... |
additional parameters for the clustermethods called by
|
x |
object of class |
statistics |
specifies in |
xlim |
transferred to |
breaks |
transferred to |
Here are some guidelines for interpretation. There is some theoretical justification to consider a Jaccard similarity value smaller or equal to 0.5 as an indication of a "dissolved cluster", see Hennig (2008). Generally, a valid, stable cluster should yield a mean Jaccard similarity value of 0.75 or more. Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful. Below average Jaccard values of 0.6, clusters should not be trusted. "Highly stable" clusters should yield average Jaccard similarities of 0.85 and above. All of this refers to bootstrap; for the other resampling schemes it depends on the tuning constants, though their default values should grant similar interpretations in most cases.
While B=100
is recommended, smaller run numbers could give
quite informative results as well, if computation times become too high.
Note that the stability of a cluster is assessed, but
stability is not the only important validity criterion - clusters
obtained by very inflexible clustering methods may be stable but not
valid, as discussed in Hennig (2007).
See plotcluster
for graphical cluster validation.
Information about interface functions for clustering methods:
The following interface functions are currently
implemented (in the present package; note that almost all of these
functions require the specification of some control parameters, so
if you use one of them, look up their common help page
kmeansCBI
) first:
an interface to the function
kmeans
for k-means clustering. This assumes a
cases*variables matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering with
optional noise cluster. This
function produces a partition and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering. This
function produces a tree (not only a partition; therefore the
number of clusters can be huge!) and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering with
optional noise cluster. This
function produces a partition and assumes a dissimilarity
matrix as input.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a cases*variables matrix as
input. Warning: mclustBIC
sometimes has
problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: mclustBIC
sometimes has
problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.
an interface to the functions
pam
and clara
for partitioning around medoids. This can be used with
cases*variables as well as dissimilarity matrices as input.
an interface to the function
pamk
for partitioning around medoids. The number
of cluster is estimated by the average silhouette width.
This can be used with
cases*variables as well as dissimilarity matrices as input.
an interface to the function
tclust
in the tclust library for trimmed Gaussian
clustering. This assumes a cases*variables matrix as input. Note
that this function is not currently provided because the tclust
package is only available in the CRAN archives, but the code is
in the Examples-section of the kmeansCBI
-help page.
an interface to the function
dbscan
for density based
clustering. This can be used with
cases*variables as well as dissimilarity matrices as input..
an interface to the function
fixmahal
for fixed point
clustering. This assumes a cases*variables matrix as input.
an interface to the function
mergenormals
for clustering by merging Gaussian
mixture components.
an interface to the function
specc
for spectral clustering.
You can write your own interface function. The first argument of an
interface function should preferably be a data matrix (of class
"matrix", but it may be a symmetrical dissimilarity matrix). It can
be a data frame, but this restricts some of the functionality of
clusterboot
, see above. Further
arguments can be tuning constants for the clustering method. The
output of an interface function should be a list containing (at
least) the following components:
clustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later.
number of clusters. If some points don't belong to any
cluster but are declared as "noise", nc
includes the
noise cluster, and there should be another component
nccl
, being the number of clusters not including the
noise cluster (note that it is not mandatory to define a noise
component if not all points are assigned to clusters, but if you
do it, the stability of the noise cluster is assessed as
well.)
this is a list consisting of a logical vectors
of length of the number of data points (n
) for each cluster,
indicating whether a point is a member of this cluster
(TRUE
) or not. If a noise cluster is included, it
should always be the last vector in this list.
an integer vector of length n
,
partitioning the data. If the method produces a partition, it
should be the clustering. This component is only used for plots,
so you could do something like rep(1,n)
for
non-partitioning methods. If a noise cluster is included,
nc=nccl+1
and the noise cluster is cluster no. nc
.
a string indicating the clustering method.
clusterboot
returns an object of class "clboot"
, which
is a list with components
result, partition, nc, clustermethod, B, noisemethod, bootmethod,
multipleboot, dissolution, recover, bootresult, bootmean, bootbrd,
bootrecover, jitterresult, jittermean, jitterbrd, jitterrecover,
subsetresult, subsetmean, subsetbrd, subsetrecover, bojitresult,
bojitmean, bojitbrd, bojitrecover, noiseresult, noisemean,
noisebrd, noiserecover
.
result |
clustering result; full output of the selected
|
partition |
partition parameter of the selected |
nc |
number of clusters in original data (including noise
component if |
nccl |
number of clusters in original data (not including noise
component if |
clustermethod , B , noisemethod , bootmethod , multipleboot , dissolution , recover
|
input parameters, see above. |
bootresult |
matrix of Jaccard similarities for
|
bootmean |
clusterwise means of the |
bootbrd |
clusterwise number of times a cluster has been dissolved. |
bootrecover |
clusterwise number of times a cluster has been successfully recovered. |
subsetresult , subsetmean , etc.
|
same as |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis 99, 1154-1176.
dist
,
interface functions:
kmeansCBI
, hclustCBI
,
hclusttreeCBI
, disthclustCBI
,
noisemclustCBI
, distnoisemclustCBI
,
claraCBI
, pamkCBI
,
dbscanCBI
, mahalCBI
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) cf1 <- clusterboot(face,B=3,bootmethod= c("boot","noise","jitter"),clustermethod=kmeansCBI, krange=5,seed=15555) print(cf1) plot(cf1) cf2 <- clusterboot(dist(face),B=3,bootmethod= "subset",clustermethod=disthclustCBI, k=5, cut="number", method="average", showplots=TRUE, seed=15555) print(cf2) d1 <- c("a","b","a","c") d2 <- c("a","a","a","b") dx <- as.data.frame(cbind(d1,d2)) cpx <- clusterboot(dx,k=2,B=10,clustermethod=claraCBI, multipleboot=TRUE,usepam=TRUE,datatomatrix=FALSE) print(cpx)
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) cf1 <- clusterboot(face,B=3,bootmethod= c("boot","noise","jitter"),clustermethod=kmeansCBI, krange=5,seed=15555) print(cf1) plot(cf1) cf2 <- clusterboot(dist(face),B=3,bootmethod= "subset",clustermethod=disthclustCBI, k=5, cut="number", method="average", showplots=TRUE, seed=15555) print(cf2) d1 <- c("a","b","a","c") d2 <- c("a","a","a","b") dx <- as.data.frame(cbind(d1,d2)) cpx <- clusterboot(dx,k=2,B=10,clustermethod=claraCBI, multipleboot=TRUE,usepam=TRUE,datatomatrix=FALSE) print(cpx)
Generates tuning constants ca
for fixmahal
dependent on
the number of points and variables of the current fixed point cluster
(FPC).
This is experimental and only thought for use in fixmahal
.
cmahal(n, p, nmin, cmin, nc1, c1 = cmin, q = 1)
cmahal(n, p, nmin, cmin, nc1, c1 = cmin, q = 1)
n |
positive integer. Number of points. |
p |
positive integer. Number of variables. |
nmin |
integer larger than 1. Smallest number of points for which
|
cmin |
positive number. Minimum value for |
nc1 |
positive integer. Number of points at which |
c1 |
positive numeric. Tuning constant for |
q |
numeric between 0 and 1. 1 for steepest possible descent of
|
Some experiments suggest that the tuning constant ca
should
decrease with increasing FPC size and increase with increasing
p
in fixmahal
. This is to prevent too small
meaningless FPCs while maintaining the significant larger
ones. cmahal
with q=1
computes ca
in such a way
that as long as ca>cmin
, the decrease in n
is as steep
as possible in order to maintain the validity of the convergence
theorem in Hennig and Christlieb (2002).
A numeric vector of length n
, giving the values for ca
for all FPC sizes smaller or equal to n
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. and Christlieb, N. (2002) Validating visual clusters in large datasets: Fixed point clusters of spectral features, Computational Statistics and Data Analysis 40, 723-739.
plot(1:100,cmahal(100,3,nmin=5,cmin=qchisq(0.99,3),nc1=90), xlab="FPC size", ylab="cmahal")
plot(1:100,cmahal(100,3,nmin=5,cmin=qchisq(0.99,3),nc1=90), xlab="FPC size", ylab="cmahal")
Computes the connectivity components of an undirected graph from a matrix giving the edges.
con.comp(comat)
con.comp(comat)
comat |
a symmetric logical or 0-1 matrix, where |
The "depth-first search" algorithm of Cormen, Leiserson and Rivest (1990, p. 477) is used.
An integer vector, giving the number of the connectivity component for each vertice.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Cormen, T. H., Leiserson, C. E. and Rivest, R. L. (1990), Introduction to Algorithms, Cambridge: MIT Press.
hclust
, cutree
for cutted single linkage
trees (often equivalent).
set.seed(1000) x <- rnorm(20) m <- matrix(0,nrow=20,ncol=20) for(i in 1:20) for(j in 1:20) m[i,j] <- abs(x[i]-x[j]) d <- m<0.2 cc <- con.comp(d) max(cc) # number of connectivity components plot(x,cc) # The same should be produced by # cutree(hclust(as.dist(m),method="single"),h=0.2).
set.seed(1000) x <- rnorm(20) m <- matrix(0,nrow=20,ncol=20) for(i in 1:20) for(j in 1:20) m[i,j] <- abs(x[i]-x[j]) d <- m<0.2 cc <- con.comp(d) max(cc) # number of connectivity components plot(x,cc) # The same should be produced by # cutree(hclust(as.dist(m),method="single"),h=0.2).
Estimates a misclassification probability in a mixture distribution between two mixture components from estimated posterior probabilities regardless of component parameters, see Hennig (2010).
confusion(z,pro,i,j,adjustprobs=FALSE)
confusion(z,pro,i,j,adjustprobs=FALSE)
z |
matrix of posterior probabilities for observations (rows) to belong to mixture components (columns), so entries need to sum up to 1 for each row. |
pro |
vector of component proportions, need to sum up to 1. |
i |
integer. Component number. |
j |
integer. Component number. |
adjustprobs |
logical. If |
Estimated probability that an observation generated by component
j
is classified to component i
by maximum a posteriori rule.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
set.seed(12345) m <- rpois(20,lambda=5) dim(m) <- c(5,4) pro <- apply(m,2,sum) pro <- pro/sum(pro) m <- m/apply(m,1,sum) round(confusion(m,pro,1,2),digits=2)
set.seed(12345) m <- rpois(20,lambda=5) dim(m) <- c(5,4) pro <- apply(m,2,sum) pro <- pro/sum(pro) m <- m/apply(m,1,sum) round(confusion(m,pro,1,2),digits=2)
Returns a list containing estimates of the weighted covariance
matrix and the mean of the data, and optionally of the (weighted)
correlation matrix. The
covariance matrix is divided by the sum of the weights,
corresponding to n
and the ML-estimator in the case of equal
weights, as opposed to n-1
for cov.wt
.
cov.wml(x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE)
cov.wml(x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE)
x |
a matrix or data frame. As usual, rows are observations and columns are variables. |
wt |
a non-negative and non-zero vector of weights for each
observation. Its length must equal the number of rows of
|
cor |
A logical indicating whether the estimated correlation weighted matrix will be returned as well. |
center |
Either a logical or a numeric vector specifying the centers
to be used when computing covariances. If |
A list containing the following named components:
cov |
the estimated (weighted) covariance matrix. |
center |
an estimate for the center (mean) of the data. |
n.obs |
the number of observations (rows) in |
wt |
the weights used in the estimation. Only returned if given as an argument. |
cor |
the estimated correlation matrix. Only returned if ‘cor’ is ‘TRUE’. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,8,7,6,5,8,9,10) cov.wml(cbind(x,y),wt=c(0,0,0,1,1,1,1,1,0,0)) cov.wt(cbind(x,y),wt=c(0,0,0,1,1,1,1,1,0,0))
x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,8,7,6,5,8,9,10) cov.wml(cbind(x,y),wt=c(0,0,0,1,1,1,1,1,0,0)) cov.wt(cbind(x,y),wt=c(0,0,0,1,1,1,1,1,0,0))
This is a more sophisticated version of cluster.stats
for use with clusterbenchstats
, see Hennig (2017).
Computes a number of distance-based statistics, which can be used for cluster
validation, comparison between clusterings and decision about
the number of clusters: cluster sizes, cluster diameters,
average distances within and between clusters, cluster separation,
biggest within cluster gap,
average silhouette widths, the Calinski and Harabasz index,
a Pearson version of
Hubert's gamma coefficient, the Dunn index, further statistics
introduced
in Hennig (2017) and two indexes
to assess the similarity of two clusterings, namely the corrected Rand
index and Meila's VI.
cqcluster.stats(d = NULL, clustering, alt.clustering = NULL, noisecluster = FALSE, silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap = TRUE, sepindex = TRUE, sepprob = 0.1, sepwithnoise = TRUE, compareonly = FALSE, aggregateonly = FALSE, averagegap=FALSE, pamcrit=TRUE, dquantile=0.1, nndist=TRUE, nnk=2, standardisation="max", sepall=TRUE, maxk=10, cvstan=sqrt(length(clustering))) ## S3 method for class 'cquality' summary(object,stanbound=TRUE,largeisgood=TRUE, ...) ## S3 method for class 'summary.cquality' print(x, ...)
cqcluster.stats(d = NULL, clustering, alt.clustering = NULL, noisecluster = FALSE, silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap = TRUE, sepindex = TRUE, sepprob = 0.1, sepwithnoise = TRUE, compareonly = FALSE, aggregateonly = FALSE, averagegap=FALSE, pamcrit=TRUE, dquantile=0.1, nndist=TRUE, nnk=2, standardisation="max", sepall=TRUE, maxk=10, cvstan=sqrt(length(clustering))) ## S3 method for class 'cquality' summary(object,stanbound=TRUE,largeisgood=TRUE, ...) ## S3 method for class 'summary.cquality' print(x, ...)
d |
a distance object (as generated by |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
alt.clustering |
an integer vector such as for
|
noisecluster |
logical. If |
silhouette |
logical. If |
G2 |
logical. If |
G3 |
logical. If |
wgap |
logical. If |
sepindex |
logical. If |
sepprob |
numerical between 0 and 1, see |
sepwithnoise |
logical. If |
compareonly |
logical. If |
aggregateonly |
logical. If |
averagegap |
logical. If |
pamcrit |
logical. If |
dquantile |
numerical between 0 and 1; quantile used for kernel density estimator for density indexes, see Hennig (2019), Sec. 3.6. |
nndist |
logical. If |
nnk |
integer. Number of neighbours used in average and
coefficient of
variation of distance to nearest within cluster neighbour (clusters
with |
standardisation |
|
sepall |
logical. If |
maxk |
numeric. Parsimony is defined as the number of clusters
divided by |
cvstan |
numeric. |
object |
object of class |
x |
object of class |
stanbound |
logical. If |
largeisgood |
logical. If |
... |
no effect. |
The standardisation
-parameter governs the standardisation of
the index values.
standardisation="none"
means that unstandardised
raw values of indexes are given out. Otherwise, entropy
will be
standardised by the
maximum possible value for the given number of clusters;
within.cluster.ss
and between.cluster.ss
will be
standardised by the overall sum of squares; mnnd
will be
standardised by the maximum distance to the nnk
th nearest
neighbour within cluster; pearsongamma
will be standardised
by adding 1 and dividing by 2; cvnn
will be standardised by
cvstan
(the default is the possible maximum).
standardisation
allows options for the standardisation of
average.within, sindex, wgap, pamcrit, max.diameter,
min.separation
and can be "max"
(maximum distance),
"ave"
(average distance), q90
(0.9-quantile of
distances), or a positive number. "max"
is the default and
standardises all the listed indexes into the range [0,1].
cqcluster.stats
with compareonly=FALSE
and
aggregateonly=FALSE
returns a list of type
cquality
containing the components
n, cluster.number, cluster.size, min.cluster.size, noisen,
diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, ave.between.matrix, average.between, average.within,
n.between, n.within, max.diameter, min.separation,
within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch, cwidegap,
widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd,
pamc, pamcentroids, dindex, denscut, highdgap, npenalty, dpenalty,
withindensp, densoc, pdistto, pclosetomode, distto, percwdens,
percdensoc, parsimony, cvnnd, cvnndc
. Some of these are
standardised, see Details. If
compareonly=TRUE
, only corrected.rand, vi
are given
out. If aggregateonly=TRUE
, only n, cluster.number,
min.cluster.size, noisen, diameter,
average.between, average.within,
max.diameter, min.separation,
within.cluster.ss, avg.silwidth,
g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch,
widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd,
pamc, pamcentroids, dindex, denscut, highdgap, parsimony, cvnnd,
cvnndc
are given out.
summary.cquality
returns a list of type summary.cquality
with components average.within,nnk,mnnd,
avg.silwidth,
widestgap,sindex,
pearsongamma,entropy,pamc,
within.cluster.ss,
dindex,denscut,highdgap,
parsimony,max.diameter,
min.separation,cvnnd
. These are as documented below for
cqcluster.stats
, but after transformation by stanbound
and largeisgood
, see arguments.
n |
number of points. |
cluster.number |
number of clusters. |
cluster.size |
vector of cluster sizes (number of points). |
min.cluster.size |
size of smallest cluster. |
noisen |
number of noise points, see argument |
diameter |
vector of cluster diameters (maximum within cluster distances). |
average.distance |
vector of clusterwise within cluster average distances. |
median.distance |
vector of clusterwise within cluster distance medians. |
separation |
vector of clusterwise minimum distances of a point in the cluster to a point of another cluster. |
average.toother |
vector of clusterwise average distances of a point in the cluster to the points of other clusters. |
separation.matrix |
matrix of separation values between all pairs of clusters. |
ave.between.matrix |
matrix of mean dissimilarities between points of every pair of clusters. |
avebetween |
average distance between clusters. |
avewithin |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
n.between |
number of distances between clusters. |
n.within |
number of distances within clusters. |
maxdiameter |
maximum cluster diameter. |
minsep |
minimum cluster separation. |
withinss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
clus.avg.silwidths |
vector of cluster average silhouette
widths. See
|
asw |
average silhouette
width. See |
g2 |
Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62). |
g3 |
G3 coefficient. See Gordon (1999, p. 62). |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
dunn |
minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002). |
dunn2 |
minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes. |
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
wb.ratio |
|
ch |
Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2013). |
cwidegap |
vector of widest within-cluster gaps. |
widestgap |
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter |
corrected.rand |
corrected Rand index (if |
vi |
variation of information (VI) index (if |
sindex |
separation index, see argument |
svec |
vector of smallest closest distances of points to next
cluster that are used in the computation of |
psep |
vector of all closest distances of points to next cluster. |
stan |
value by which som statistics were standardised, see Details. |
nnk |
value of input parameter |
mnnd |
average distance to |
pamc |
average distance to cluster centroid. |
pamcentroids |
index numbers of cluster centroids. |
dindex |
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good. |
denscut |
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good. |
highdgap |
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good. |
npenalty |
vector of penalties for all clusters that are used
in the computation of |
depenalty |
vector of penalties for all clusters that are used in
the computation of |
withindensp |
distance-based kernel density values for all points as computed in Sec. 3.6 of Hennig (2019). |
densoc |
contribution of points from other clusters than the one
to which a point is assigned to the density, for all points; called
|
pdistto |
list that for all clusters has a sequence of point numbers. These are the points already incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2019) to which the next point to be joined is connected. |
pclosetomode |
list that for all clusters has a sequence of point numbers. Sequence of points to be incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2019). |
distto |
list that for all clusters has a sequence of differences
between the standardised densities (see |
percwdens |
this is |
percdensoc |
this is |
parsimony |
number of clusters divided by |
cvnnd |
coefficient of variation of dissimilarities to
|
cvnndc |
vector of cluster-wise coefficients of variation of
dissimilarities to |
Because cqcluster.stats
processes a full dissimilarity matrix, it
isn't suitable for large data sets. You may consider
distcritmulti
in that case.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. In: Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): "Data Analysis, Machine Learning and Knowledge Discovery", Springer, Berlin, 41-49.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
cluster.stats
,
silhouette
, dist
, calinhara
,
distcritmulti
.
clusterboot
computes clusterwise stability statistics by
resampling.
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cqcluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cqcluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))
Cluster validity index based on nearest neighbours as defined in Liu et al. (2013) with a correction explained in Halkidi et al. (2015).
cvnn(d=NULL,clusterings,k=5)
cvnn(d=NULL,clusterings,k=5)
d |
dissimilarity matrix or |
clusterings |
list of vectors of integers with length |
k |
integer. Number of nearest neighbours. |
List with components (see Liu et al. (2013), Halkidi et al. (2015) for details)
cvnnindex |
vector of index values for the various clusterings, see Liu et al. (2013), the lower the better. |
sep |
vector of separation values. |
comp |
vector of compactness values. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Halkidi, M., Vazirgiannis, M. and Hennig, C. (2015) Method-independent
indices for cluster validation. In C. Hennig, M. Meila, F. Murtagh,
R. Rocci (eds.) Handbook of Cluster Analysis, CRC
Press/Taylor &
Francis, Boca Raton.
Liu, Y, Li, Z., Xiong, H., Gao, X., Wu, J. and Wu, S. (2013) Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics 43, 982-994.
options(digits=3) iriss <- as.matrix(iris[c(1:10,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:10,51:55,101:105),5]) print(cvnn(dist(iriss),list(irisc,rep(1:4,5))))
options(digits=3) iriss <- as.matrix(iris[c(1:10,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:10,51:55,101:105),5]) print(cvnn(dist(iriss),list(irisc,rep(1:4,5))))
For use in awcoord
only.
cweight(x, ca)
cweight(x, ca)
x |
numerical. |
ca |
numerical. |
ca/x
if smaller than 1, else 1.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
cweight(4,1)
cweight(4,1)
Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw", "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL) ## S3 method for class 'dbscan' print(x, ...) ## S3 method for class 'dbscan' plot(x, data, ...) ## S3 method for class 'dbscan' predict(object, data, newdata = NULL, predict.max=1000, ...)
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw", "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL) ## S3 method for class 'dbscan' print(x, ...) ## S3 method for class 'dbscan' plot(x, data, ...) ## S3 method for class 'dbscan' predict(object, data, newdata = NULL, predict.max=1000, ...)
data |
data matrix, data.frame, dissimilarity matrix or
|
eps |
Reachability distance, see Ester et al. (1996). |
MinPts |
Reachability minimum no. of points, see Ester et al. (1996). |
scale |
scale the data if |
method |
"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements). |
seeds |
FALSE to not include the |
showplot |
0 = no plot, 1 = plot per iteration, 2 = plot per subiteration. |
countmode |
NULL or vector of point numbers at which to report progress. |
x |
object of class |
object |
object of class |
newdata |
matrix or data.frame with raw data to predict. |
predict.max |
max. batch size for predictions. |
... |
Further arguments transferred to plot methods. |
Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).
We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.
print.dbscan
shows a statistic of the number of points
belonging to the clusters that are seeds and border points.
plot.dbscan
distinguishes between seed and border points by
plot symbol.
predict.dbscan
gives out a vector of predicted clusters for the
points in newdata
.
dbscan
gives out
an object of class 'dbscan' which is a LIST with components
cluster |
integer vector coding cluster membership with noise observations (singletons) coded as 0 |
isseed |
logical vector indicating whether a point is a seed (not border, not noise) |
eps |
parameter eps |
MinPts |
parameter MinPts |
this is a simplified version of the original algorithm (no K-D-trees
used), thus we have instead of
Jens Oehlschlaegel, based on a draft by Christian Hennig.
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
set.seed(665544) n <- 600 x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.2)) par(bg="grey40") ds <- dbscan(x, 0.2) # run with showplot=1 to see how dbscan works. ds plot(ds, x) x2 <- matrix(0,nrow=4,ncol=2) x2[1,] <- c(5,2) x2[2,] <- c(8,3) x2[3,] <- c(4,4) x2[4,] <- c(9,9) predict(ds, x, x2) n <- 600 x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2)) # Not run, but results from my machine are 0.105 - 0.068 - 0.255: # system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] # system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3] # system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL, # method="dist"))[3]
set.seed(665544) n <- 600 x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.2)) par(bg="grey40") ds <- dbscan(x, 0.2) # run with showplot=1 to see how dbscan works. ds plot(ds, x) x2 <- matrix(0,nrow=4,ncol=2) x2[1,] <- c(5,2) x2[2,] <- c(8,3) x2[3,] <- c(4,4) x2[4,] <- c(9,9) predict(ds, x, x2) n <- 600 x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2)) # Not run, but results from my machine are 0.105 - 0.068 - 0.255: # system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] # system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3] # system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL, # method="dist"))[3]
Simulates p-value for dip test (see dip
)
in the way suggested by Tantrum, Murua and Stuetzle (2003) from the
clostest unimodal distribution determined by kernel density estimation
with bandwith chosen so that the density just becomes unimodal. This is
less conservative (and in fact sometimes anti-conservative) than the
values from dip.test
.
dipp.tantrum(xdata,d,M=100)
dipp.tantrum(xdata,d,M=100)
xdata |
numeric vector. One-dimensional dataset. |
d |
numeric. Value of dip statistic. |
M |
integer. Number of artificial datasets generated in order to estimate the p-value. |
List with components
p.value |
approximated p-value. |
bw |
borderline unimodality bandwith in |
dv |
vector of dip statistic values from simulated artificial data. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality, Annals of Statistics, 13, 70-84.
Tantrum, J., Murua, A. and Stuetzle, W. (2003) Assessment and Pruning of Hierarchical Model Based Clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 197-205.
# not run, requires package diptest # x <- runif(100) # d <- dip(x) # dt <- dipp.tantrum(x,d,M=10)
# not run, requires package diptest # x <- runif(100) # d <- dip(x) # dt <- dipp.tantrum(x,d,M=10)
Diptest (Hartigan and Hartigan, 1985, see dip
)
for data projected in discriminant coordinate separating optimally two
class means (see discrcoord
) as suggested by Tantrum, Murua and
Stuetzle (2003).
diptest.multi(xdata,class,pvalue="uniform",M=100)
diptest.multi(xdata,class,pvalue="uniform",M=100)
xdata |
matrix. Potentially multidimensional dataset. |
class |
vector of integers giving class numbers for observations. |
pvalue |
|
M |
integer. Number of artificial datasets generated in order to
estimate the p-value if |
The resulting p-value.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality, Annals of Statistics, 13, 70-84.
Tantrum, J., Murua, A. and Stuetzle, W. (2003) Assessment and Pruning of Hierarchical Model Based Clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 197-205.
require(diptest) x <- cbind(runif(100),runif(100)) partition <- 1+(x[,1]<0.5) d1 <- diptest.multi(x,partition) d2 <- diptest.multi(x,partition,pvalue="tantrum",M=10)
require(diptest) x <- cbind(runif(100),runif(100)) partition <- 1+(x[,1]<0.5) d1 <- diptest.multi(x,partition) d2 <- diptest.multi(x,partition,pvalue="tantrum",M=10)
Computes discriminant coordinates, sometimes referred to as "canonical variates" as described in Seber (1984).
discrcoord(xd, clvecd, pool = "n", ...)
discrcoord(xd, clvecd, pool = "n", ...)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
pool |
string. Determines how the within classes covariance is pooled. "n" means that the class covariances are weighted corresponding to the number of points in each class (default). "equal" means that all classes get equal weight. |
... |
no effect |
The matrix T (see Seber (1984), p. 270) is inverted by use of
tdecomp
, which can be expected to give
reasonable results for singular within-class covariance matrices.
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley.
plotcluster
for straight forward discriminant plots.
batcoord
for discriminating projections for two classes,
so that also the differences in variance are shown (discrcoord
is
based only on differences in mean).
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) dcf <- discrcoord(face,grface) plot(dcf$proj,col=grface) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) dcf <- discrcoord(face,grface) plot(dcf$proj,col=grface) # ...done in one step by function plotcluster.
Recodes a dataset with mixed continuous and categorical variables so that the continuous variables come first and the categorical variables have standard coding 1, 2, 3,... (in lexicographical ordering of values coerced to strings).
discrete.recode(x,xvarsorted=TRUE,continuous=0,discrete)
discrete.recode(x,xvarsorted=TRUE,continuous=0,discrete)
x |
data matrix or data frame (not a tibble). The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables could take numbers or strings or anything that can be coerced to factor levels as values.) |
xvarsorted |
logical. If |
continuous |
vector of integers giving positions of the
continuous variables. If |
discrete |
vector of integers giving positions of the
categorical variables (the variables need to be coded in such a way that
|
A list with components
data |
data matrix with continuous variables first and categorical variables in standard coding behind them. |
ppdim |
vector of categorical variable-wise numbers of categories. |
discretelevels |
list of levels of the categorical variables
belonging to what is treated by |
continuous |
number of continuous variables. |
discrete |
number of categorical variables. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(c(2,4,6,8),20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(v1,d1,v2,d2) lc <- discrete.recode(ldata,xvarsorted=FALSE,continuous=c(1,3),discrete=c(2,4)) require(MASS) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4))
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(c(2,4,6,8),20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(v1,d1,v2,d2) lc <- discrete.recode(ldata,xvarsorted=FALSE,continuous=c(1,3),discrete=c(2,4)) require(MASS) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4))
An interface for ten methods of linear dimension reduction in order to separate the groups optimally in the projected data. Includes classical discriminant coordinates, methods to project differences in mean and covariance structure, asymmetric methods (separation of a homogeneous class from a heterogeneous one), local neighborhood-based methods and methods based on robust covariance matrices.
discrproj(x, clvecd, method="dc", clnum=NULL, ignorepoints=FALSE, ignorenum=0, ...)
discrproj(x, clvecd, method="dc", clnum=NULL, ignorepoints=FALSE, ignorenum=0, ...)
x |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
vector of class numbers which can be coerced into
integers; length must equal
|
method |
one of
Note that "bc", "vbc", "adc", "awc", "arc" and "anc" assume that there are only two classes. |
clnum |
integer. Number of the class which is attempted to plot homogeneously by "asymmetric methods", which are the methods assuming that there are only two classes, as indicated above. |
ignorepoints |
logical. If |
ignorenum |
one of the potential values of the components of
|
... |
additional parameters passed to the projection methods. |
discrproj
returns the output of the chosen projection method,
which is a list with at least the components ev, units, proj
.
For detailed informations see the help pages of the projection methods.
ev |
eigenvalues in descending order, usually indicating portion of information in the corresponding direction. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley.
Fukunaga (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Boston: Academic Press.
discrcoord
, batcoord
,
mvdcoord
, adcoord
,
awcoord
, ncoord
,
ancoord
.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0,p=3) grface <- as.integer(attr(face,"grouping")) # The abs in the following is there to unify the output, # because eigenvectors are defined only up to their sign. # Statistically it doesn't make sense to compute absolute values. round(abs(discrproj(face,grface, method="nc")$units),digits=2) round(abs(discrproj(face,grface, method="wnc")$units),digits=2) round(abs(discrproj(face,grface, clnum=1, method="arc")$units),digits=2)
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0,p=3) grface <- as.integer(attr(face,"grouping")) # The abs in the following is there to unify the output, # because eigenvectors are defined only up to their sign. # Statistically it doesn't make sense to compute absolute values. round(abs(discrproj(face,grface, method="nc")$units),digits=2) round(abs(discrproj(face,grface, method="wnc")$units),digits=2) round(abs(discrproj(face,grface, clnum=1, method="arc")$units),digits=2)
Computes a factor that can be used to standardise ordinal categorical variables and binary dummy variables coding categories of nominal scaled variables for Euclidean dissimilarity computation in mixed type data. See Hennig and Liao (2013).
distancefactor(cat,n=NULL, catsizes=NULL,type="categorical", normfactor=2,qfactor=ifelse(type=="categorical",1/2, 1/(1+1/(cat-1))))
distancefactor(cat,n=NULL, catsizes=NULL,type="categorical", normfactor=2,qfactor=ifelse(type=="categorical",1/2, 1/(1+1/(cat-1))))
cat |
integer. Number of categories of the variable to be standardised.
Note that for |
n |
integer. Number of data points. |
catsizes |
vector of integers giving numbers of observations per
category. One of |
type |
|
normfactor |
numeric. Factor on which standardisation is based.
As a default, this is |
qfactor |
numeric. Factor q in Hennig and Liao (2013) to adjust for clumping effects due to discreteness. |
A factor by which to multiply the variable in order to make it
comparable to a unit variance continuous variable when aggregated in
Euclidean fashion for dissimilarity computation, so that expected
effective difference between two realisations of the variable equals
qfactor*normfactor
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
set.seed(776655) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(d1,d2) lc <- cat2bin(ldata,categorical=1)$data lc[,1:5] <- lc[,1:5]*distancefactor(5,20,type="categorical") lc[,6] <- lc[,6]*distancefactor(4,20,type="ordinal")
set.seed(776655) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(d1,d2) lc <- cat2bin(ldata,categorical=1)$data lc[,1:5] <- lc[,1:5]*distancefactor(5,20,type="categorical") lc[,6] <- lc[,6]*distancefactor(4,20,type="ordinal")
Approximates average silhouette width or the Pearson version of Hubert's gamma criterion by hacking the dataset into pieces and averaging the subset-wise values, see Hennig and Liao (2013).
distcritmulti(x,clustering,part=NULL,ns=10,criterion="asw", fun="dist",metric="euclidean", count=FALSE,seed=NULL,...)
distcritmulti(x,clustering,part=NULL,ns=10,criterion="asw", fun="dist",metric="euclidean", count=FALSE,seed=NULL,...)
x |
cases times variables data matrix. |
clustering |
vector of integers indicating the clustering. |
part |
vector of integer subset sizes; sum should be smaller or
equal to the number of cases of |
ns |
integer. Number of subsets, only used if |
criterion |
|
fun |
|
metric |
passed on to |
count |
logical. if |
seed |
integer, random seed. (If |
... |
A list with components crit.overall,crit.sub,crit.sd,part
.
crit.overall |
value of criterion. |
crit.sub |
vector of subset-wise criterion values. |
crit.sd |
standard deviation of |
subsets |
list of case indexes in subsets. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
set.seed(20000) options(digits=3) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) clustering <- as.integer(attr(face,"grouping")) distcritmulti(face,clustering,ns=3,seed=100000,criterion="pearsongamma")
set.seed(20000) options(digits=3) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) clustering <- as.integer(attr(face,"grouping")) distcritmulti(face,clustering,ns=3,seed=100000,criterion="pearsongamma")
Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.
distrsimilarity(x,clustering,noisecluster = FALSE, distribution=c("normal","uniform"),nnk=2, largeisgood=FALSE,messages=FALSE)
distrsimilarity(x,clustering,noisecluster = FALSE, distribution=c("normal","uniform"),nnk=2, largeisgood=FALSE,messages=FALSE)
x |
the data matrix; a numerical object which can be coerced to a matrix. |
clustering |
integer vector of class numbers; length must equal
|
noisecluster |
logical. If |
distribution |
vector of |
nnk |
integer. Number of nearest neighbors to use for dissimilarity to the uniform. |
largeisgood |
logical. If |
messages |
logical. If |
List with the following components
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
kdnormc |
vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution. |
kdunifc |
vector of cluster-wise Kolmogorov distances between
distribution of distances to |
xmahal |
vector of Mahalanobs distances to the respective cluster center. |
xdknn |
vector of distance to |
It is very hard to capture similarity to a multivariate normal or uniform in a single value, and both used here have their shortcomings. Particularly, the dissimilarity to the uniform can still indicate a good fit if there are holes or it's a uniform distribution concentrated on several not connected sets.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
cqcluster.stats
,cluster.stats
for more cluster validity statistics.
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) km3 <- kmeans(face,3) distrsimilarity(face,km3$cluster)
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) km3 <- kmeans(face,3) distrsimilarity(face,km3$cluster)
Computes the density of a two-component Gaussian mixture along the ridgeline (Ray and Lindsay, 2005), along which all its density extrema are located.
dridgeline(alpha=seq(0,1,0.001), prop, mu1, mu2, Sigma1, Sigma2, showplot=FALSE, ...)
dridgeline(alpha=seq(0,1,0.001), prop, mu1, mu2, Sigma1, Sigma2, showplot=FALSE, ...)
alpha |
sequence of values between 0 and 1 for which the density is computed. |
prop |
mixture proportion of first component. |
mu1 |
mean vector of component 1. |
mu2 |
mean vector of component 2. |
Sigma1 |
covariance matrix of component 1. |
Sigma2 |
covariance matrix of component 2. |
showplot |
logical. If |
... |
further arguments to be passed on to plot. |
Vector of density values for values of alpha
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
q <- dridgeline(seq(0,1,0.1),0.5,c(1,1),c(2,5),diag(2),diag(2))
q <- dridgeline(seq(0,1,0.1),0.5,c(1,1),c(2,5),diag(2),diag(2))
Duda-Hart test for whether a data set should be split into two clusters.
dudahart2(x,clustering,alpha=0.001)
dudahart2(x,clustering,alpha=0.001)
x |
data matrix or data frame. |
clustering |
vector of integers. Clustering into two clusters. |
alpha |
numeric between 0 and 1. Significance level (recommended to be small if this is used for estimating the number of clusters). |
A list with components
p.value |
p-value against null hypothesis of homogemeity. |
dh |
ratio of within-cluster sum of squares for two clusters and overall sum of squares. |
compare |
critical value for |
cluster1 |
|
alpha |
see above. |
z |
|
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. Wiley, New York.
options(digits=2) set.seed(98765) iriss <- iris[sample(150,20),-5] km <- kmeans(iriss,2) dudahart2(iriss,km$cluster)
options(digits=2) set.seed(98765) iriss <- iris[sample(150,20),-5] km <- kmeans(iriss,2) dudahart2(iriss,km$cluster)
Extracts parameters of certain mixture components from the output of
summary.mclustBIC
and updates proportions so that
they sum up to 1.
extract.mixturepars(mclustsum,compnumbers,noise=FALSE)
extract.mixturepars(mclustsum,compnumbers,noise=FALSE)
mclustsum |
output object of |
compnumbers |
vector of integers. Numbers of mixture components. |
noise |
logical. Should be |
Object as component parameters
of
summary.mclustBIC
-output, but for specified
components only. (Orientation information from all components is kept.)
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss,G=5,modelNames="VEV") siris <- summary(irisBIC,iriss) emp <- extract.mixturepars(siris,2) emp$pro round(emp$mean,digits=1) emp$variance$modelName round(emp$variance$scale,digits=2)
set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss,G=5,modelNames="VEV") siris <- summary(irisBIC,iriss) emp <- extract.mixturepars(siris,2) emp$pro round(emp$mean,digits=1) emp$variance$modelName round(emp$variance$scale,digits=2)
Finds representative objects for the border of a cluster and the
within-cluster variance as defined in the framework of the cdbw
cluster validation index (and meant to be used in that context).
findrep(x,xcen,clustering,cluster,r,p=ncol(x),n=nrow(x), nc=sum(clustering==cluster))
findrep(x,xcen,clustering,cluster,r,p=ncol(x),n=nrow(x), nc=sum(clustering==cluster))
x |
matrix. Euclidean dataset. |
xcen |
mean vector of cluster. |
clustering |
vector of integers with length |
cluster |
integer. Number of cluster to be treated. |
r |
integer. Number of representatives. |
p |
integer. Number of dimensions. |
n |
integer. Number of observations. |
nc |
integer. Number of observations in |
List with components
repc |
vector of index of representatives (out of all observations). |
repx |
vector of index of representatives (out of only the
observations in |
maxr |
number of representatives (this can be smaller than
|
wvar |
estimated average within-cluster squared distance to mean. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Halkidi, M. and Vazirgiannis, M. (2008) A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters 29, 773-786.
Halkidi, M., Vazirgiannis, M. and Hennig, C. (2015) Method-independent
indices for cluster validation. In C. Hennig, M. Meila, F. Murtagh,
R. Rocci (eds.) Handbook of Cluster Analysis, CRC
Press/Taylor &
Francis, Boca Raton.
options(digits=3) iriss <- as.matrix(iris[c(1:5,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:5,51:55,101:105),5]) findrep(iriss,colMeans(iriss),irisc,cluster=1,r=2)
options(digits=3) iriss <- as.matrix(iris[c(1:5,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:5,51:55,101:105),5]) findrep(iriss,colMeans(iriss),irisc,cluster=1,r=2)
Computes Mahalanobis fixed point clusters (FPCs), i.e., subsets of the data, which consist exactly of the non-outliers w.r.t. themselves, and may be interpreted as generated from a homogeneous normal population. FPCs may overlap, are not necessarily exhausting and do not need a specification of the number of clusters.
Note that while fixmahal
has lots of parameters, only one (or
few) of them have usually to be specified, cf. the examples. The
philosophy is to allow much flexibility, but to always provide
sensible defaults.
fixmahal(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), method = "fuzzy", cgen = "fixed", ca = NA, ca2 = NA, calpha = ifelse(method=="fuzzy",0.95,0.99), calpha2 = 0.995, pointit = TRUE, subset = n, nc1 = 100+20*p, startn = 18+p, mnc = floor(startn/2), mer = ifelse(pointit,0.1,0), distcut = 0.85, maxit = 5*n, iter = n*1e-5, init.group = list(), ind.storage = TRUE, countmode = 100, plot = "none") ## S3 method for class 'mfpc' summary(object, ...) ## S3 method for class 'summary.mfpc' print(x, maxnc=30, ...) ## S3 method for class 'mfpc' plot(x, dat, no, bw=FALSE, main=c("Representative FPC No. ",no), xlab=NULL, ylab=NULL, pch=NULL, col=NULL, ...) ## S3 method for class 'mfpc' fpclusters(object, dat=NA, ca=object$ca, p=object$p, ...) fpmi(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), gv, ca, ca2, method = "ml", plot, maxit = 5*n, iter = n*1e-6)
fixmahal(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), method = "fuzzy", cgen = "fixed", ca = NA, ca2 = NA, calpha = ifelse(method=="fuzzy",0.95,0.99), calpha2 = 0.995, pointit = TRUE, subset = n, nc1 = 100+20*p, startn = 18+p, mnc = floor(startn/2), mer = ifelse(pointit,0.1,0), distcut = 0.85, maxit = 5*n, iter = n*1e-5, init.group = list(), ind.storage = TRUE, countmode = 100, plot = "none") ## S3 method for class 'mfpc' summary(object, ...) ## S3 method for class 'summary.mfpc' print(x, maxnc=30, ...) ## S3 method for class 'mfpc' plot(x, dat, no, bw=FALSE, main=c("Representative FPC No. ",no), xlab=NULL, ylab=NULL, pch=NULL, col=NULL, ...) ## S3 method for class 'mfpc' fpclusters(object, dat=NA, ca=object$ca, p=object$p, ...) fpmi(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), gv, ca, ca2, method = "ml", plot, maxit = 5*n, iter = n*1e-6)
dat |
something that can be coerced to a
numerical matrix or vector. Data matrix, rows are points, columns
are variables.
|
n |
optional positive integer. Number of cases. |
p |
optional positive integer. Number of independent variables. |
method |
a string. |
cgen |
optional string. |
ca |
optional positive number. Tuning constant, specifying
required cluster
separation. By default determined as |
ca2 |
optional positive number. Second tuning constant needed if
|
calpha |
number between 0 and 1. See |
calpha2 |
number between 0 and 1, larger than |
pointit |
optional logical. If |
subset |
optional positive integer smaller or equal than |
nc1 |
optional positive integer. Tuning constant needed by
|
startn |
optional positive integer. Size of the initial configurations. The default value is chosen to prevent that small meaningless FPCs are found, but it should be decreased if clusters of size smaller than the default value are of interest. |
mnc |
optional positive integer. Minimum size of clusters to be reported. |
mer |
optional nonnegative number. FPCs (groups of them,
respectively, see details)
are only reported as stable if the ratio
of the number of their
findings to their number of points exceeds |
distcut |
optional value between 0 and 1. A similarity
measure between FPCs, given in Hennig (2002), and the corresponding
Single Linkage groups of FPCs with similarity larger
than |
maxit |
optional integer. Maximum number of iterations per algorithm run (usually an FPC is found much earlier). |
iter |
positive number. Algorithm stops when difference between
subsequent weight vectors is smaller than |
init.group |
optional list of logical vectors of length
|
ind.storage |
optional logical. If |
countmode |
optional positive integer. Every |
plot |
optional string. If |
object |
object of class |
x |
object of class |
maxnc |
positive integer. Maximum number of FPCs to be reported. |
no |
positive integer. Number of the representative FPC to be plotted. |
bw |
optional logical. If |
main |
plot title. |
xlab |
label for x-axis. If |
ylab |
label for y-axis. If |
pch |
plotting symbol, see |
col |
plotting color, see |
gv |
logical vector (or, with |
... |
additional parameters to be passed to |
A (crisp) Mahalanobis FPC is a data subset
that reproduces itself under the following operation:
Compute mean and covariance matrix estimator for the data
subset, and compute all points of the dataset for which the squared
Mahalanobis distance is smaller than ca
.
Fixed points of this operation can be considered as clusters,
because they contain only
non-outliers (as defined by the above mentioned procedure) and all other
points are outliers w.r.t. the subset.
The current default is to compute fuzzy Mahalanobis FPCs, where the
points in the subset have a membership weight between 0 and 1 and give
rise to weighted means and covariance matrices.
The new weights are then obtained by computing the weight function
wfu
of the squared Mahalanobis distances, i.e.,
full weight for squared distances smaller than ca
, zero
weight for squared distances larger than ca2
and
decreasing weights (linear function of squared distances)
in between.
A fixed point algorithm is started from the whole dataset,
algorithms are started from the subsets specified in
init.group
, and further algorithms are started from further
initial configurations as explained under subset
and in the
function mahalconf
.
Usually some of the FPCs are unstable, and more than one FPC may
correspond to the same significant pattern in the data. Therefore the
number of FPCs is reduced: A similarity matrix is computed
between FPCs. Similarity between sets is defined as the ratio between
2 times size of
intersection and the sum of sizes of both sets. The Single Linkage
clusters (groups)
of level distcut
are computed, i.e. the connectivity
components of the graph where edges are drawn between FPCs with
similarity larger than distcut
. Groups of FPCs whose members
are found often enough (cf. parameter mer
) are considered as
stable enough. A representative FPC is
chosen for every Single Linkage cluster of FPCs according to the
maximum expectation ratio ser
. ser
is the ratio between
the number of findings of an FPC and the number of points
of an FPC, adjusted suitably if subset<n
.
Usually only the representative FPCs of stable groups
are of interest.
Default tuning constants are taken from Hennig (2005).
Generally, the default settings are recommended for
fixmahal
. For large datasets, the use of
init.group
together with pointit=FALSE
is useful. Occasionally, mnc
and startn
may be chosen
smaller than the default,
if smaller clusters are of interest, but this may lead to too many
clusters. Decrease of
ca
will often lead to too many clusters, even for homogeneous
data. Increase of ca
will produce only very strongly
separated clusters. Both may be of interest occasionally.
Singular covariance matrices during the iterations are handled by
solvecov
.
summary.mfpc
gives a summary about the representative FPCs of
stable groups.
plot.mfpc
is a plot method for the representative FPC of stable
group no. no
. It produces a scatterplot, where
the points belonging to the FPC are highlighted, the mean is and
for p<3
also the region of the FPC is shown. For p>=3
,
the optimal separating projection computed by batcoord
is shown.
fpclusters.mfpc
produces a list of indicator vectors for the
representative FPCs of stable groups.
fpmi
is called by fixmahal
for a single fixed point
algorithm and will usually not be executed alone.
fixmahal
returns an object of class mfpc
. This is a list
containing the components nc, g, means, covs, nfound, er, tsc,
ncoll, skc, grto, imatrix, smatrix, stn, stfound, ser, sfpc, ssig,
sto, struc, n, p, method, cgen, ca, ca2, cvec, calpha, pointit,
subset, mnc, startn, mer, distcut
.
summary.mfpc
returns an object of class summary.mfpc
.
This is a list containing the components means, covs, stn,
stfound, sn, ser, tskip, skc, tsc, sim, ca, ca2, calpha, mer, method,
cgen, pointit
.
fpclusters.mfpc
returns a list of indicator vectors for the
representative FPCs of stable groups.
fpmi
returns a list with the components mg, covg, md,
gv, coll, method, ca
.
nc |
integer. Number of FPCs. |
g |
list of logical vectors. Indicator vectors of FPCs. |
means |
list of numerical vectors. Means of FPCs. In
|
covs |
list of numerical matrices. Covariance matrices of FPCs. In
|
nfound |
vector of integers. Number of findings for the FPCs. |
er |
numerical vector. Ratio of number of findings of FPCs to their
size. Under |
tsc |
integer. Number of algorithm runs leading to too small or too seldom found FPCs. |
ncoll |
integer. Number of algorithm runs where collinear covariance matrices occurred. |
skc |
integer. Number of skipped clusters. |
grto |
vector of integers. Numbers of FPCs to which algorithm
runs led, which were started by |
imatrix |
vector of integers. Size of intersection between
FPCs. See |
smatrix |
numerical vector. Similarities between
FPCs. See |
stn |
integer. Number of representative FPCs of stable groups.
In |
stfound |
vector of integers. Number of findings of members of
all groups of FPCs. In |
ser |
numerical vector. Ratio of number of findings of groups of
FPCs to their size. Under |
sfpc |
vector of integers. Numbers of representative FPCs of all groups. |
ssig |
vector of integers of length |
sto |
vector of integers. Numbers of groups ordered
according to largest |
struc |
vector of integers. Number of group an FPC belongs to. |
n |
see arguments. |
p |
see arguments. |
method |
see arguments. |
cgen |
see arguments. |
ca |
see arguments, if |
ca2 |
see arguments. |
cvec |
numerical vector of length |
calpha |
see arguments. |
pointit |
see arguments. |
subset |
see arguments. |
mnc |
see arguments. |
startn |
see arguments. |
mer |
see arguments. |
distcut |
see arguments. |
sn |
vector of integers. Number of points of representative FPCs. |
tskip |
integer. Number of algorithm runs leading to skipped FPCs. |
sim |
vector of integers. Size of intersections between
representative FPCs of stable groups. See |
mg |
mean vector. |
covg |
covariance matrix. |
md |
Mahalanobis distances. |
gv |
logical (numerical, respectively, if |
coll |
logical. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
Hennig, C. (2005) Fuzzy and Crisp Mahalanobis Fixed Point Clusters, in Baier, D., Decker, R., and Schmidt-Thieme, L. (eds.): Data Analysis and Decision Support. Springer, Heidelberg, 47-56.
Hennig, C. and Christlieb, N. (2002) Validating visual clusters in large datasets: Fixed point clusters of spectral features, Computational Statistics and Data Analysis 40, 723-739.
fixreg
for linear regression fixed point clusters.
mahalconf
, wfu
, cmahal
for computation of initial configurations, weights, tuning constants.
sseg
for indexing the similarity/intersection vectors
computed by fixmahal
.
batcoord
, cov.rob
, solvecov
,
cov.wml
, plotcluster
for computation of projections, (inverted)
covariance matrices, plotting.
rFace
for generation of example data, see below.
options(digits=2) set.seed(20000) face <- rFace(400,dMoNo=2,dNoEy=0, p=3) # The first example uses grouping information via init.group. initg <- list() grface <- as.integer(attr(face,"grouping")) for (i in 1:5) initg[[i]] <- (grface==i) ff0 <- fixmahal(face, pointit=FALSE, init.group=initg) summary(ff0) cff0 <- fpclusters(ff0) plot(face, col=1+cff0[[1]]) plot(face, col=1+cff0[[4]]) # Why does this come out as a cluster? plot(ff0, face, 4) # A bit clearer... # Without grouping information, examples need more time: # ff1 <- fixmahal(face) # summary(ff1) # cff1 <- fpclusters(ff1) # plot(face, col=1+cff1[[1]]) # plot(face, col=1+cff1[[6]]) # Why does this come out as a cluster? # plot(ff1, face, 6) # A bit clearer... # ff2 <- fixmahal(face,method="ml") # summary(ff2) # ff3 <- fixmahal(face,method="ml",calpha=0.95,subset=50) # summary(ff3) ## ...fast, but lots of clusters. mer=0.3 may be useful here. # set.seed(3000) # face2 <- rFace(400,dMoNo=2,dNoEy=0) # ff5 <- fixmahal(face2) # summary(ff5) ## misses right eye of face data; with p=6, ## initial configurations are too large for 40 point clusters # ff6 <- fixmahal(face2, startn=30) # summary(ff6) # cff6 <- fpclusters(ff6) # plot(face2, col=1+cff6[[3]]) # plot(ff6, face2, 3) # x <- c(1,2,3,6,6,7,8,120) # ff8 <- fixmahal(x) # summary(ff8) # ...dataset a bit too small for the defaults... # ff9 <- fixmahal(x, mnc=3, startn=3) # summary(ff9)
options(digits=2) set.seed(20000) face <- rFace(400,dMoNo=2,dNoEy=0, p=3) # The first example uses grouping information via init.group. initg <- list() grface <- as.integer(attr(face,"grouping")) for (i in 1:5) initg[[i]] <- (grface==i) ff0 <- fixmahal(face, pointit=FALSE, init.group=initg) summary(ff0) cff0 <- fpclusters(ff0) plot(face, col=1+cff0[[1]]) plot(face, col=1+cff0[[4]]) # Why does this come out as a cluster? plot(ff0, face, 4) # A bit clearer... # Without grouping information, examples need more time: # ff1 <- fixmahal(face) # summary(ff1) # cff1 <- fpclusters(ff1) # plot(face, col=1+cff1[[1]]) # plot(face, col=1+cff1[[6]]) # Why does this come out as a cluster? # plot(ff1, face, 6) # A bit clearer... # ff2 <- fixmahal(face,method="ml") # summary(ff2) # ff3 <- fixmahal(face,method="ml",calpha=0.95,subset=50) # summary(ff3) ## ...fast, but lots of clusters. mer=0.3 may be useful here. # set.seed(3000) # face2 <- rFace(400,dMoNo=2,dNoEy=0) # ff5 <- fixmahal(face2) # summary(ff5) ## misses right eye of face data; with p=6, ## initial configurations are too large for 40 point clusters # ff6 <- fixmahal(face2, startn=30) # summary(ff6) # cff6 <- fpclusters(ff6) # plot(face2, col=1+cff6[[3]]) # plot(ff6, face2, 3) # x <- c(1,2,3,6,6,7,8,120) # ff8 <- fixmahal(x) # summary(ff8) # ...dataset a bit too small for the defaults... # ff9 <- fixmahal(x, mnc=3, startn=3) # summary(ff9)
Computes linear regression fixed point clusters (FPCs), i.e., subsets of the data, which consist exactly of the non-outliers w.r.t. themselves, and may be interpreted as generated from a homogeneous linear regression relation between independent and dependent variable. FPCs may overlap, are not necessarily exhausting and do not need a specification of the number of clusters.
Note that while fixreg
has lots of parameters, only one (or
few) of them have usually to be specified, cf. the examples. The
philosophy is to allow much flexibility, but to always provide
sensible defaults.
fixreg(indep=rep(1,n), dep, n=length(dep), p=ncol(as.matrix(indep)), ca=NA, mnc=NA, mtf=3, ir=NA, irnc=NA, irprob=0.95, mncprob=0.5, maxir=20000, maxit=5*n, distcut=0.85, init.group=list(), ind.storage=FALSE, countmode=100, plot=FALSE) ## S3 method for class 'rfpc' summary(object, ...) ## S3 method for class 'summary.rfpc' print(x, maxnc=30, ...) ## S3 method for class 'rfpc' plot(x, indep=rep(1,n), dep, no, bw=TRUE, main=c("Representative FPC No. ",no), xlab="Linear combination of independents", ylab=deparse(substitute(indep)), xlim=NULL, ylim=range(dep), pch=NULL, col=NULL,...) ## S3 method for class 'rfpc' fpclusters(object, indep=NA, dep=NA, ca=object$ca, ...) rfpi(indep, dep, p, gv, ca, maxit, plot)
fixreg(indep=rep(1,n), dep, n=length(dep), p=ncol(as.matrix(indep)), ca=NA, mnc=NA, mtf=3, ir=NA, irnc=NA, irprob=0.95, mncprob=0.5, maxir=20000, maxit=5*n, distcut=0.85, init.group=list(), ind.storage=FALSE, countmode=100, plot=FALSE) ## S3 method for class 'rfpc' summary(object, ...) ## S3 method for class 'summary.rfpc' print(x, maxnc=30, ...) ## S3 method for class 'rfpc' plot(x, indep=rep(1,n), dep, no, bw=TRUE, main=c("Representative FPC No. ",no), xlab="Linear combination of independents", ylab=deparse(substitute(indep)), xlim=NULL, ylim=range(dep), pch=NULL, col=NULL,...) ## S3 method for class 'rfpc' fpclusters(object, indep=NA, dep=NA, ca=object$ca, ...) rfpi(indep, dep, p, gv, ca, maxit, plot)
indep |
numerical matrix or vector. Independent
variables.
Leave out for clustering one-dimensional data.
|
dep |
numerical vector. Dependent variable.
|
n |
optional positive integer. Number of cases. |
p |
optional positive integer. Number of independent variables. |
ca |
optional positive number. Tuning constant, specifying
required cluster
separation. By default determined automatically as a
function of |
mnc |
optional positive integer. Minimum size of clusters
to be reported.
By default determined automatically as a function of
|
mtf |
optional positive integer. FPCs must be found at
least |
ir |
optional positive integer. Number of algorithm runs.
By default determined
automatically as a function of |
irnc |
optional positive integer. Size of the smallest
cluster to be found with
approximated probability |
irprob |
optional value between 0 and 1. Approximated
probability for a cluster of size |
mncprob |
optional value between 0 amd 1. Approximated
probability for a cluster of size |
maxir |
optional integer. Maximum number of algorithm runs. |
maxit |
optional integer. Maximum number of iterations per algorithm run (usually an FPC is found much earlier). |
distcut |
optional value between 0 and 1. A similarity
measure between FPCs, given in Hennig (2002a), and the corresponding
Single Linkage groups of FPCs with similarity larger
than |
init.group |
optional list of logical vectors of length
|
ind.storage |
optional logical. If |
countmode |
optional positive integer. Every |
plot |
optional logical. If |
object |
object of class |
x |
object of class |
maxnc |
positive integer. Maximum number of FPCs to be reported. |
no |
positive integer. Number of the representative FPC to be plotted. |
bw |
optional logical. If |
main |
plot title. |
xlab |
label for x-axis. |
ylab |
label for y-axis. |
xlim |
plotted range of x-axis. If |
ylim |
plotted range of y-axis. |
pch |
plotting symbol, see |
col |
plotting color, see |
gv |
logical vector of length |
... |
additional parameters to be passed to |
A linear regression FPC is a data subset
that reproduces itself under the following operation:
Compute linear regression and error variance estimator for the data
subset, and compute all points of the dataset for which the squared
residual is smaller than ca
times the error variance.
Fixed points of this operation can be considered as clusters,
because they contain only
non-outliers (as defined by the above mentioned procedure) and all other
points are outliers w.r.t. the subset. fixreg
performs ir
fixed point algorithms started from
random subsets of size p+2
to look for
FPCs. Additionally an algorithm is started from the whole dataset,
and algorithms are started from the subsets specified in
init.group
.
Usually some of the FPCs are unstable, and more than one FPC may
correspond to the same significant pattern in the data. Therefore the
number of FPCs is reduced: FPCs with less than mnc
points are
ignored. Then a similarity matrix is computed between the remaining
FPCs. Similarity between sets is defined as the ratio between
2 times size of
intersection and the sum of sizes of both sets. The Single Linkage
clusters (groups)
of level distcut
are computed, i.e. the connectivity
components of the graph where edges are drawn between FPCs with
similarity larger than distcut
. Groups of FPCs whose members
are found mtf
times or more are considered as stable enough.
A representative FPC is
chosen for every Single Linkage cluster of FPCs according to the
maximum expectation ratio ser
. ser
is the ratio between
the number of findings of an FPC and the estimated
expectation of the number of findings of an FPC of this size,
called expectation ratio and
computed by clusexpect
.
Usually only the representative FPCs of stable groups
are of interest.
The choice of the involved tuning constants such as ca
and
ir
is discussed in detail in Hennig (2002a). Statistical theory
is presented in Hennig (2003).
Generally, the default settings are recommended for
fixreg
. In cases where they lead to a too large number of
algorithm runs (e.g., always for p>4
), the use of
init.group
together with mtf=1
and ir=0
is useful. Occasionally, irnc
may be chosen
smaller than the default,
if smaller clusters are of interest, but this may lead to too many
clusters and too many algorithm runs. Decrease of
ca
will often lead to too many clusters, even for homogeneous
data. Increase of ca
will produce only very strongly
separated clusters. Both may be of interest occasionally.
rfpi
is called by fixreg
for a single fixed point
algorithm and will usually not be executed alone.
summary.rfpc
gives a summary about the representative FPCs of
stable groups.
plot.rfpc
is a plot method for the representative FPC of stable
group
no. no
. It produces a scatterplot of the linear combination of
independent variables determined by the regression coefficients of the
FPC vs. the dependent variable. The regression line and the region
of non-outliers determined by ca
are plotted as well.
fpclusters.rfpc
produces a list of indicator vectors for the
representative FPCs of stable groups.
fixreg
returns an object of class rfpc
. This is a list
containing the components nc, g, coefs, vars, nfound, er, tsc,
ncoll, grto, imatrix, smatrix, stn, stfound, sfpc, ssig, sto, struc,
n, p, ca, ir, mnc, mtf, distcut
.
summary.rfpc
returns an object of class summary.rfpc
.
This is a list containing the components coefs, vars, stfound,
stn, sn, ser, tsc, sim, ca, ir, mnc, mtf
.
fpclusters.rfpc
returns a list of indicator vectors for the
representative FPCs of stable groups.
rfpi
returns a list with the components coef, var, g,
coll, ca
.
nc |
integer. Number of FPCs. |
g |
list of logical vectors. Indicator vectors of FPCs. |
coefs |
list of numerical vectors. Regression coefficients of
FPCs. In |
vars |
list of numbers. Error variances of FPCs. In
|
nfound |
vector of integers. Number of findings for the FPCs. |
er |
numerical vector. Expectation ratios of FPCs. Can be taken as a stability measure. |
tsc |
integer. Number of algorithm runs leading to too small or too seldom found FPCs. |
ncoll |
integer. Number of algorithm runs where collinear regressor matrices occurred. |
grto |
vector of integers. Numbers of FPCs to which algorithm
runs led, which were started by |
imatrix |
vector of integers. Size of intersection between
FPCs. See |
smatrix |
numerical vector. Similarities between
FPCs. See |
stn |
integer. Number of representative FPCs of stable groups. In
|
stfound |
vector of integers. Number of findings of members of
all groups of FPCs. In
|
sfpc |
vector of integers. Numbers of representative FPCs. |
ssig |
vector of integers. As |
sto |
vector of integers. Number of representative FPC of most, 2nd most, ..., often found group of FPCs. |
struc |
vector of integers. Number of group an FPC belongs to. |
n |
see arguments. |
p |
see arguments. |
ca |
see arguments. |
ir |
see arguments. |
mnc |
see arguments. |
mtf |
see arguments. |
distcut |
see arguments. |
sn |
vector of integers. Number of points of representative FPCs. |
ser |
numerical vector. Expectation ratio for stable groups. |
sim |
vector of integers. Size of intersections between
representative FPCs of stable groups. See |
coef |
vector of regression coefficients. |
var |
error variance. |
g |
logical indicator vector of iterated FPC. |
coll |
logical. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
Hennig, C. (2003) Clusters, outliers and regression: fixed point clusters, Journal of Multivariate Analysis 86, 183-212.
fixmahal
for fixed point clusters in the usual setup
(non-regression).
regmix
for clusterwiese linear regression by mixture
modeling ML.
can
, itnumber
for computation of the default
settings.
clusexpect
for estimation of the expected number of
findings of an FPC of given size.
itnumber
for the generation of the number of fixed point
algorithms.
minsize
for the smallest FPC size to be found with a given
probability..
sseg
for indexing the similarity/intersection vectors
computed by fixreg
.
set.seed(190000) options(digits=3) data(tonedata) attach(tonedata) tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20) summary(tonefix) # This is designed to have a fast example; default setting would be better. # If you want to see more (and you have a bit more time), # try out the following: ## Not run: set.seed(1000) tonefix <- fixreg(stretchratio,tuned) # Default - good for these data summary(tonefix) plot(tonefix,stretchratio,tuned,1) plot(tonefix,stretchratio,tuned,2) plot(tonefix,stretchratio,tuned,3,bw=FALSE,pch=5) toneclus <- fpclusters(tonefix,stretchratio,tuned) plot(stretchratio,tuned,col=1+toneclus[[2]]) tonefix2 <- fixreg(stretchratio,tuned,distcut=1,mtf=1,countmode=50) # Every found fixed point cluster is reported, # no matter how instable it may be. summary(tonefix2) tonefix3 <- fixreg(stretchratio,tuned,ca=7) # ca defaults to 10.07 for these data. summary(tonefix3) subset <- c(rep(FALSE,5),rep(TRUE,24),rep(FALSE,121)) tonefix4 <- fixreg(stretchratio,tuned, mtf=1,ir=0,init.group=list(subset)) summary(tonefix4) ## End(Not run)
set.seed(190000) options(digits=3) data(tonedata) attach(tonedata) tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20) summary(tonefix) # This is designed to have a fast example; default setting would be better. # If you want to see more (and you have a bit more time), # try out the following: ## Not run: set.seed(1000) tonefix <- fixreg(stretchratio,tuned) # Default - good for these data summary(tonefix) plot(tonefix,stretchratio,tuned,1) plot(tonefix,stretchratio,tuned,2) plot(tonefix,stretchratio,tuned,3,bw=FALSE,pch=5) toneclus <- fpclusters(tonefix,stretchratio,tuned) plot(stretchratio,tuned,col=1+toneclus[[2]]) tonefix2 <- fixreg(stretchratio,tuned,distcut=1,mtf=1,countmode=50) # Every found fixed point cluster is reported, # no matter how instable it may be. summary(tonefix2) tonefix3 <- fixreg(stretchratio,tuned,ca=7) # ca defaults to 10.07 for these data. summary(tonefix3) subset <- c(rep(FALSE,5),rep(TRUE,24),rep(FALSE,121)) tonefix4 <- fixreg(stretchratio,tuned, mtf=1,ir=0,init.group=list(subset)) summary(tonefix4) ## End(Not run)
flexmixedruns
fits a latent class
mixture (clustering) model where some variables are continuous
and modelled within the mixture components by Gaussian distributions
and some variables are categorical and modelled within components by
independent multinomial distributions. The fit is by maximum
likelihood estimation computed with the EM-algorithm. The number of
components can be estimated by the BIC.
Note that at least one categorical variable is needed, but it is possible to use data without continuous variable.
flexmixedruns(x,diagonal=TRUE,xvarsorted=TRUE, continuous,discrete,ppdim=NULL,initial.cluster=NULL, simruns=20,n.cluster=1:20,verbose=TRUE,recode=TRUE, allout=TRUE,control=list(minprior=0.001),silent=TRUE)
flexmixedruns(x,diagonal=TRUE,xvarsorted=TRUE, continuous,discrete,ppdim=NULL,initial.cluster=NULL, simruns=20,n.cluster=1:20,verbose=TRUE,recode=TRUE, allout=TRUE,control=list(minprior=0.001),silent=TRUE)
x |
data matrix or data frame. The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables could take numbers or strings or anything that can be coerced to factor levels as values.) |
diagonal |
logical. If |
xvarsorted |
logical. If |
continuous |
vector of integers giving positions of the
continuous variables. If |
discrete |
vector of integers giving positions of the
categorical variables. If |
ppdim |
vector of integers specifying the number of (in the data)
existing categories for each categorical variable. If
|
initial.cluster |
this corresponds to the |
simruns |
integer. Number of starts of the EM algorithm with random initialisation in order to find a good global optimum. |
n.cluster |
vector of integers, numbers of components (the optimum one is found by minimising the BIC). |
verbose |
logical. If |
recode |
logical. If |
allout |
logical. If |
control |
list of control parameters for |
silent |
logical. This is passed on to the
|
Sometimes flexmix produces errors because of degenerating covariance
matrices, too small clusters etc. flexmixedruns
tolerates these
and treats them as non-optimal runs. (Higher simruns
or
different control
may be required to get a valid solution.)
General documentation on flexmix can be found in Friedrich Leisch's "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", https://CRAN.R-project.org/package=flexmix
A list with components
optsummary |
summary object for |
optimalk |
optimal number of components. |
errcount |
vector with numbers of EM runs for each number of components that led to flexmix errors. |
flexout |
if
If |
bicvals |
vector of values of the BIC for each number of components. |
ppdim |
vector of categorical variable-wise numbers of categories. |
discretelevels |
list of levels of the categorical variables
belonging to what is treated by |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
lcmixed
, flexmix
,
FLXcontrol-class
,
flexmix-class
,
discrete.recode
.
options(digits=3) set.seed(776655) v1 <- rnorm(100) v2 <- rnorm(100) d1 <- sample(1:5,100,replace=TRUE) d2 <- sample(1:4,100,replace=TRUE) ldata <- cbind(v1,v2,d1,d2) fr <- flexmixedruns(ldata, continuous=2,discrete=2,simruns=2,n.cluster=2:3,allout=FALSE) print(fr$optimalk) print(fr$optsummary) print(fr$flexout@cluster) print(fr$flexout@components)
options(digits=3) set.seed(776655) v1 <- rnorm(100) v2 <- rnorm(100) d1 <- sample(1:5,100,replace=TRUE) d2 <- sample(1:4,100,replace=TRUE) ldata <- cbind(v1,v2,d1,d2) fr <- flexmixedruns(ldata, continuous=2,discrete=2,simruns=2,n.cluster=2:3,allout=FALSE) print(fr$optimalk) print(fr$optsummary) print(fr$flexout@cluster) print(fr$flexout@components)
fpclusters
is a generic function which extracts the
representative fixed point clusters (FPCs)
from FPC objects generated by fixmahal
and
fixreg
. For documentation and examples see
fixmahal
and fixreg
.
fpclusters(object, ...)
fpclusters(object, ...)
object |
object of class |
... |
further arguments depending on the method. |
a list of logical or numerical vectors indicating or giving the weights of the cluster memberships.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Computes the number of fixed point iterations needed by
fixreg
to find mtf
times
a fixed point cluster (FPC) of size
cn
with an approximated probability of prob
.
Thought for use within fixreg
.
itnumber(n, p, cn, mtf, prob = 0.95, maxir = 20000)
itnumber(n, p, cn, mtf, prob = 0.95, maxir = 20000)
n |
positive integer. Total number of points. |
p |
positive integer. Number of independent variables. |
cn |
positive integer smaller or equal to |
mtf |
positive integer. |
prob |
number between 0 and 1. |
maxir |
positive integer. |
The computation is based on the binomial distribution with probability
given by clusexpect
with ir=1
.
An integer.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
itnumber(500,4,150,2)
itnumber(500,4,150,2)
Jitters some variables in a data matrix.
jittervar(x,jitterv=NULL,factor=1)
jittervar(x,jitterv=NULL,factor=1)
x |
data matrix or data frame. |
jitterv |
vector of numbers of variables to be jittered. |
factor |
numeric. Passed on to |
data matrix or data frame with jittered variables.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(v1,v2,d1,d2) jv <- jittervar(ldata,jitterv=3:4)
set.seed(776655) v1 <- rnorm(20) v2 <- rnorm(20) d1 <- sample(1:5,20,replace=TRUE) d2 <- sample(1:4,20,replace=TRUE) ldata <- cbind(v1,v2,d1,d2) jv <- jittervar(ldata,jitterv=3:4)
These functions provide an interface to several clustering methods
implemented in R, for use together with the cluster stability
assessment in clusterboot
(as parameter
clustermethod
; "CBI" stands for "clusterboot interface").
In some situations it could make sense to use them to compute a
clustering even if you don't want to run clusterboot
, because
some of the functions contain some additional features (e.g., normal
mixture model based clustering of dissimilarity matrices projected
into the Euclidean space by MDS or partitioning around medoids with
estimated number of clusters, noise/outlier identification in
hierarchical clustering).
kmeansCBI(data,krange,k,scaling=FALSE,runs=1,criterion="ch",...) hclustCBI(data,k,cut="number",method,scaling=TRUE,noisecut=0,...) hclusttreeCBI(data,minlevel=2,method,scaling=TRUE,...) disthclustCBI(dmatrix,k,cut="number",method,noisecut=0,...) noisemclustCBI(data,G,k,modelNames,nnk,hcmodel=NULL,Vinv=NULL, summary.out=FALSE,...) distnoisemclustCBI(dmatrix,G,k,modelNames,nnk, hcmodel=NULL,Vinv=NULL,mdsmethod="classical", mdsdim=4, summary.out=FALSE, points.out=FALSE,...) claraCBI(data,k,usepam=TRUE,diss=inherits(data,"dist"),...) pamkCBI(data,krange=2:10,k=NULL,criterion="asw", usepam=TRUE, scaling=FALSE,diss=inherits(data,"dist"),...) tclustCBI(data,k,trim=0.05,...) dbscanCBI(data,eps,MinPts,diss=inherits(data,"dist"),...) mahalCBI(data,clustercut=0.5,...) mergenormCBI(data, G=NULL, k=NULL, modelNames=NULL, nnk=0, hcmodel = NULL, Vinv = NULL, mergemethod="bhat", cutoff=0.1,...) speccCBI(data,k,...) pdfclustCBI(data,...) stupidkcentroidsCBI(dmatrix,k,distances=TRUE) stupidknnCBI(dmatrix,k) stupidkfnCBI(dmatrix,k) stupidkavenCBI(dmatrix,k)
kmeansCBI(data,krange,k,scaling=FALSE,runs=1,criterion="ch",...) hclustCBI(data,k,cut="number",method,scaling=TRUE,noisecut=0,...) hclusttreeCBI(data,minlevel=2,method,scaling=TRUE,...) disthclustCBI(dmatrix,k,cut="number",method,noisecut=0,...) noisemclustCBI(data,G,k,modelNames,nnk,hcmodel=NULL,Vinv=NULL, summary.out=FALSE,...) distnoisemclustCBI(dmatrix,G,k,modelNames,nnk, hcmodel=NULL,Vinv=NULL,mdsmethod="classical", mdsdim=4, summary.out=FALSE, points.out=FALSE,...) claraCBI(data,k,usepam=TRUE,diss=inherits(data,"dist"),...) pamkCBI(data,krange=2:10,k=NULL,criterion="asw", usepam=TRUE, scaling=FALSE,diss=inherits(data,"dist"),...) tclustCBI(data,k,trim=0.05,...) dbscanCBI(data,eps,MinPts,diss=inherits(data,"dist"),...) mahalCBI(data,clustercut=0.5,...) mergenormCBI(data, G=NULL, k=NULL, modelNames=NULL, nnk=0, hcmodel = NULL, Vinv = NULL, mergemethod="bhat", cutoff=0.1,...) speccCBI(data,k,...) pdfclustCBI(data,...) stupidkcentroidsCBI(dmatrix,k,distances=TRUE) stupidknnCBI(dmatrix,k) stupidkfnCBI(dmatrix,k) stupidkavenCBI(dmatrix,k)
data |
a numeric matrix. The data
matrix - usually a cases*variables-data matrix. |
dmatrix |
a squared numerical dissimilarity matrix or a
|
k |
numeric, usually integer. In most cases, this is the number
of clusters for methods where this is fixed. For |
scaling |
either a logical value or a numeric vector of length
equal to the number of variables. If |
runs |
integer. Number of random initializations from which the k-means algorithm is started. |
criterion |
|
cut |
either "level" or "number". This determines how
|
method |
method for hierarchical clustering, see the
documentation of |
noisecut |
numeric. All clusters of size |
minlevel |
integer. |
G |
vector of integers. Number of clusters or numbers of clusters
used by
|
modelNames |
vector of string. Models for covariance matrices,
see documentation of
|
nnk |
integer. Tuning constant for
|
hcmodel |
string or |
Vinv |
numeric. See documentation of
|
summary.out |
logical. If |
mdsmethod |
"classical", "kruskal" or "sammon". Determines the
multidimensional scaling method to compute Euclidean data from a
dissimilarity matrix. See |
mdsdim |
integer. Dimensionality of MDS solution. |
points.out |
logical. If |
usepam |
logical. If |
diss |
logical. If |
krange |
vector of integers. Numbers of clusters to be compared. |
trim |
numeric between 0 and 1. Proportion of data points
trimmed, i.e., assigned to noise. See |
eps |
numeric. The radius of the neighborhoods to be considered
by |
MinPts |
integer. How many points have to be in a neighborhood so
that a point is considered to be a cluster seed? See documentation
of |
clustercut |
numeric between 0 and 1. If |
mergemethod |
method for merging Gaussians, passed on as
|
cutoff |
numeric between 0 and 1, tuning constant for
|
distances |
logical (only for |
... |
further parameters to be transferred to the original clustering functions (not required). |
All these functions call clustering methods implemented in R to
cluster data and to provide output in the format required by
clusterboot
. Here is a brief overview. For further
details see the help pages of the involved clustering methods.
an interface to the function
kmeansruns
calling kmeans
for k-means clustering. (kmeansruns
allows the
specification of several random initializations of the
k-means algorithm and estimation of k by the Calinski-Harabasz
index or the average silhouette width.)
an interface to the function
hclust
for agglomerative hierarchical clustering with
noise component (see parameter noisecut
above). This
function produces a partition and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering. This
function gives out all clusters belonging to the hierarchy
(upward from a certain level, see parameter minlevel
above).
an interface to the function
hclust
for agglomerative hierarchical clustering with
noise component (see parameter noisecut
above). This
function produces a partition and assumes a dissimilarity
matrix as input.
an interface to the function
mclustBIC
, for normal mixture model based
clustering. Warning: mclustBIC
often
has problems with multiple
points. In clusterboot
, it is recommended to use
this together with multipleboot=FALSE
.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: mclustBIC
often has
problems with multiple
points. In clusterboot
, it is recommended to use
this together with multipleboot=FALSE
.
an interface to the functions
pam
and clara
for partitioning around medoids.
an interface to the function
pamk
calling pam
for
partitioning around medoids. The number
of clusters is estimated by the Calinski-Harabasz index or by the
average silhouette width.
an interface to the function
tclust
in the tclust package for trimmed Gaussian
clustering. This assumes a cases*variables matrix as input.
an interface to the function
dbscan
for density based
clustering.
an interface to the function
fixmahal
for fixed point
clustering. This assumes a cases*variables matrix as input.
an interface to the function
mergenormals
for clustering by merging Gaussian
mixture components. Unlike mergenormals
, mergenormCBI
includes the computation of the initial Gaussian mixture.
This assumes a cases*variables matrix as input.
an interface to the function
specc
for spectral clustering. See
the specc
help page for additional tuning
parameters. This assumes a cases*variables matrix as input.
an interface to the function
pdfCluster
for density-based clustering. See
the pdfCluster
help page for additional tuning
parameters. This assumes a cases*variables matrix as input.
an interface to the function
stupidkcentroids
for random centroid-based clustering. See
the stupidkcentroids
help page. This can have a
distance matrix as well as a cases*variables matrix as input, see
parameter distances
.
an interface to the function
stupidknn
for random nearest neighbour clustering. See
the stupidknn
help page. This assumes a
distance matrix as input.
an interface to the function
stupidkfn
for random farthest neighbour clustering. See
the stupidkfn
help page. This assumes a
distance matrix as input.
an interface to the function
stupidkaven
for random average dissimilarity clustering. See
the stupidkaven
help page. This assumes a
distance matrix as input.
All interface functions return a list with the following components
(there may be some more, see summary.out
and points.out
above):
result |
clustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later. |
nc |
number of clusters. If some points don't belong to any
cluster, these are declared "noise". |
clusterlist |
this is a list consisting of a logical vectors
of length of the number of data points ( |
partition |
an integer vector of length |
clustermethod |
a string indicating the clustering method. |
The output of some of the functions has further components:
nccl |
see |
nnk |
by |
initnoise |
logical vector, indicating initially estimated noise by
|
noise |
logical. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
clusterboot
, dist
,
kmeans
, kmeansruns
, hclust
,
mclustBIC
,
pam
, pamk
,
clara
,
dbscan
,
fixmahal
,
tclust
, pdfCluster
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) dbs <- dbscanCBI(face,eps=1.5,MinPts=4) dhc <- disthclustCBI(dist(face),method="average",k=1.5,noisecut=2) table(dbs$partition,dhc$partition) dm <- mergenormCBI(face,G=10,modelNames="EEE",nnk=2) dtc <- tclustCBI(face,6,trim=0.1,restr.fact=500) table(dm$partition,dtc$partition)
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) dbs <- dbscanCBI(face,eps=1.5,MinPts=4) dhc <- disthclustCBI(dist(face),method="average",k=1.5,noisecut=2) table(dbs$partition,dhc$partition) dm <- mergenormCBI(face,G=10,modelNames="EEE",nnk=2) dtc <- tclustCBI(face,6,trim=0.1,restr.fact=500) table(dm$partition,dtc$partition)
This calls the function kmeans
to perform a k-means
clustering, but initializes the k-means algorithm several times with
random points from the data set as means. Furthermore, it is more
robust against the occurrence of empty clusters in the algorithm and
it estimates the number of clusters by either the Calinski Harabasz
index (calinhara
) or average silhouette width (see
pam.object
). The Duda-Hart test
(dudahart2
) is applied to decide whether there should be
more than one cluster (unless 1 is excluded as number of clusters).
kmeansruns(data,krange=2:10,criterion="ch", iter.max=100,runs=100, scaledata=FALSE,alpha=0.001, critout=FALSE,plot=FALSE,...)
kmeansruns(data,krange=2:10,criterion="ch", iter.max=100,runs=100, scaledata=FALSE,alpha=0.001, critout=FALSE,plot=FALSE,...)
data |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
krange |
integer vector. Numbers of clusters which are to be
compared by the average silhouette width criterion. Note: average
silhouette width and Calinski-Harabasz can't estimate number of
clusters |
criterion |
one of |
iter.max |
integer. The maximum number of iterations allowed. |
runs |
integer. Number of starts of the k-means algorithm. |
scaledata |
logical. If |
alpha |
numeric between 0 and 1, tuning constant for
|
critout |
logical. If |
plot |
logical. If |
... |
further arguments to be passed on to |
The output of the optimal run of the kmeans
-function
with added components bestk
and crit
.
A list with components
cluster |
A vector of integers indicating the cluster to which each point is allocated. |
centers |
A matrix of cluster centers. |
withinss |
The within-cluster sum of squares for each cluster. |
size |
The number of points in each cluster. |
bestk |
The optimal number of clusters. |
crit |
Vector with values of the |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100-108.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
kmeans
, pamk
,
calinhara
, dudahart2
)
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) pka <- kmeansruns(face,krange=1:5,critout=TRUE,runs=2,criterion="asw") pkc <- kmeansruns(face,krange=1:5,critout=TRUE,runs=2,criterion="ch")
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) pka <- kmeansruns(face,krange=1:5,critout=TRUE,runs=2,criterion="asw") pkc <- kmeansruns(face,krange=1:5,critout=TRUE,runs=2,criterion="ch")
lcmixed
is a method for the
flexmix
-function in package
flexmix
. It provides the necessary information to run an
EM-algorithm for maximum likelihood estimation for a latent class
mixture (clustering) model where some variables are continuous
and modelled within the mixture components by Gaussian distributions
and some variables are categorical and modelled within components by
independent multinomial distributions. lcmixed
can be called
within flexmix
. The function flexmixedruns
is a wrapper
function that can be run to apply lcmixed
.
Note that at least one categorical variable is needed, but it is possible to use data without continuous variable.
There are further format restrictions to the data (see below in the
documentation of continuous
and discrete
), which
can be ignored when running lcmixed
through
flexmixedruns
.
lcmixed( formula = .~. , continuous, discrete, ppdim, diagonal = TRUE, pred.ordinal=FALSE, printlik=FALSE )
lcmixed( formula = .~. , continuous, discrete, ppdim, diagonal = TRUE, pred.ordinal=FALSE, printlik=FALSE )
formula |
a formula to specify response and explanatory
variables. For |
continuous |
number of continuous variables. Note that the continuous variables always need to be the first variables in the matrix or data frame. |
discrete |
number of categorical variables. Always the last variables in the matrix or data frame. Note that categorical variables always must be coded as integers 1,2,3, etc. without interruption. |
ppdim |
vector of integers specifying the number of (in the data) existing categories for each categorical variable. |
diagonal |
logical. If |
pred.ordinal |
logical. If |
printlik |
logical. If |
The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2.
General documentation on flexmix methods can be found in Chapter 4 of Friedrich Leisch's "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", https://CRAN.R-project.org/package=flexmix
An object of class FLXMC
(not documented; only used
internally by flexmix
).
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
flexmixedruns
, flexmix
,
flexmix-class
,
discrete.recode
, which recodes a dataset into the format
required by lcmixed
set.seed(112233) options(digits=3) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=2, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) summary(fcc)
set.seed(112233) options(digits=3) require(MASS) require(flexmix) data(Cars93) Cars934 <- Cars93[,c(3,5,8,10)] cc <- discrete.recode(Cars934,xvarsorted=FALSE,continuous=c(2,3),discrete=c(1,4)) fcc <- flexmix(cc$data~1,k=2, model=lcmixed(continuous=2,discrete=2,ppdim=c(6,3),diagonal=TRUE)) summary(fcc)
This computes a matrix formalising 'local shape', i.e., aggregated
standardised variance/covariance in a Mahalanobis neighbourhood of the data
points. This can be used for finding clusters when used as one of the
covariance matrices in
Invariant Coordinate Selection (function ics
in package
ICS
), see Hennig's
discussion and rejoinder of Tyler et al. (2009).
localshape(xdata,proportion=0.1,mscatter="mcd",mcdalpha=0.8, covstandard="det")
localshape(xdata,proportion=0.1,mscatter="mcd",mcdalpha=0.8, covstandard="det")
xdata |
objects times variables data matrix. |
proportion |
proportion of points to be considered as neighbourhood. |
mscatter |
"mcd" or "cov"; specified minimum covariance determinant or classical covariance matrix to be used for Mahalanobis distance computation. |
mcdalpha |
if |
covstandard |
one of "trace", "det" or "none", determining by what constant the pointwise neighbourhood covariance matrices are standardised. "det" makes the affine equivariant, as noted in the discussion rejoinder of Tyler et al. (2009). |
The local shape matrix.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en
Tyler, D. E., Critchley, F., Duembgen, L., Oja, H. (2009) Invariant coordinate selection (with discussion). Journal of the Royal Statistical Society, Series B, 549-592.
options(digits=3) data(iris) localshape(iris[,-5],mscatter="cov")
options(digits=3) data(iris) localshape(iris[,-5],mscatter="cov")
Vector of Mahalanobis distances or their root. For use in awcoord
only.
mahalanodisc(x2, mg, covg, modus="square")
mahalanodisc(x2, mg, covg, modus="square")
x2 |
numerical data matrix. |
mg |
mean vector. |
covg |
covariance matrix. |
modus |
"md" (roots of Mahalanobis distances) or "square" (original squared form of Mahalanobis distances). |
The covariance matrix
is inverted by use of
solvecov
, which can be expected to give
reasonable results for singular within-class covariance matrices.
vector of (rooted) Mahalanobis distances.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
options(digits=3) x <- cbind(rnorm(50),rnorm(50)) mahalanodisc(x,c(0,0),cov(x)) mahalanodisc(x,c(0,0),matrix(0,ncol=2,nrow=2))
options(digits=3) x <- cbind(rnorm(50),rnorm(50)) mahalanodisc(x,c(0,0),cov(x)) mahalanodisc(x,c(0,0),matrix(0,ncol=2,nrow=2))
Computes the vector of (classical or robust)
Mahalanobis distances of all points of x
to the center of the points indexed (or weighted)
by gv
. The latter also determine
the covariance matrix.
Thought for use within fixmahal
.
mahalanofix(x, n = nrow(as.matrix(x)), p = ncol(as.matrix(x)), gv = rep(1, times = n), cmax = 1e+10, method = "ml") mahalanofuz(x, n = nrow(as.matrix(x)), p = ncol(as.matrix(x)), gv = rep(1, times=n), cmax = 1e+10)
mahalanofix(x, n = nrow(as.matrix(x)), p = ncol(as.matrix(x)), gv = rep(1, times = n), cmax = 1e+10, method = "ml") mahalanofuz(x, n = nrow(as.matrix(x)), p = ncol(as.matrix(x)), gv = rep(1, times=n), cmax = 1e+10)
x |
a numerical data matrix, rows are points, columns are variables. |
n |
positive integer. Number of points. |
p |
positive integer. Number of variables. |
gv |
for |
cmax |
positive number. used in |
method |
|
solvecov
is used to invert the covariance matrix. The methods
"mcd"
and "mve"
in mahalanofix
do not work properly
with point constellations with singular covariance matrices!
A list of the following components:
md |
vector of Mahalanobis distances. |
mg |
mean of the points indexed by |
covg |
covariance matrix of the points indexed by |
covinv |
|
coll |
logical. If |
Methods "mcd"
and "mve"
require library lqs
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,8,7,6,5,8,9,10) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,1,0,0)) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,0,0,0)) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,1,0,0),method="mcd") mahalanofuz(cbind(x,y),gv=c(0,0,0.5,0.5,1,1,1,0.5,0.5,0))
x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,8,7,6,5,8,9,10) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,1,0,0)) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,0,0,0)) mahalanofix(cbind(x,y),gv=c(0,0,0,1,1,1,1,1,0,0),method="mcd") mahalanofuz(cbind(x,y),gv=c(0,0,0.5,0.5,1,1,1,0.5,0.5,0))
Generates an initial configuration of startn
points from
dataset x
for the fixmahal
fixed point iteration.
Thought only for use within fixmahal
.
mahalconf(x, no, startn, covall, plot)
mahalconf(x, no, startn, covall, plot)
x |
numerical matrix. Rows are points, columns are variables. |
no |
integer between 1 and |
startn |
integer between 1 and |
covall |
covariance matrix for the computation of the first Mahalanobis distances. |
plot |
a string. If equal to |
mahalconf
first chooses the (number of variables)
nearest points to point no.
no
in terms of the Mahalanobis
distance w.r.t. covall
, so that there are points.
In every further step, the covariance
matrix of the current configuration is computed and the nearest point
in terms of the new Mahalanobis distance is
added.
solvecov
is used to invert singular covariance
matrices.
A logical vector of length nrow(x)
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0,p=2) mahalconf(face,no=200,startn=20,covall=cov(face),plot="start")
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0,p=2) mahalconf(face,no=200,startn=20,covall=cov(face),plot="start")
Clustering by merging Gaussian mixture components; computes all methods introduced in Hennig (2010) from an initial mclust clustering. See details section for details.
mergenormals(xdata, mclustsummary=NULL, clustering, probs, muarray, Sigmaarray, z, method=NULL, cutoff=NULL, by=0.005, numberstop=NULL, renumber=TRUE, M=50, ...) ## S3 method for class 'mergenorm' summary(object, ...) ## S3 method for class 'summary.mergenorm' print(x, ...)
mergenormals(xdata, mclustsummary=NULL, clustering, probs, muarray, Sigmaarray, z, method=NULL, cutoff=NULL, by=0.005, numberstop=NULL, renumber=TRUE, M=50, ...) ## S3 method for class 'mergenorm' summary(object, ...) ## S3 method for class 'summary.mergenorm' print(x, ...)
xdata |
data (something that can be coerced into a matrix). |
mclustsummary |
output object from
|
clustering |
vector of integers. Initial assignment of data to mixture components. |
probs |
vector of component proportions (for all components; should sum up to one). |
muarray |
matrix of component means (rows). |
Sigmaarray |
array of component covariance matrices (third dimension refers to component number). |
z |
matrix of observation- (row-)wise posterior probabilities of belonging to the components (columns). |
method |
one of |
cutoff |
numeric between 0 and 1. Tuning constant, see details and Hennig (2010). If not specified, the default values given in (9) in Hennig (2010) are used. |
by |
real between 0 and 1. Interval width for density computation
along the ridgeline, used for methods |
numberstop |
integer. If specified, |
renumber |
logical. If |
M |
integer. Number of times the dataset is divided into two
halves. Used if |
... |
additional optional parameters to pass on to
|
object |
object of class |
x |
object of class |
Mixture components are merged in a hierarchical fashion. The merging
criterion is computed for all pairs of current clusters and the two
clusters with the highest criterion value (lowest, respectively, for
method="predictive"
) are merged. Then criterion values are
recomputed for the merged cluster. Merging is continued until the
criterion value to merge is below (or above, for
method="predictive"
) the cutoff value. Details are given in
Hennig (2010). The following criteria are offered, specified by the
method
-argument.
components are only merged if their mixture is
unimodal according to Ray and Lindsay's (2005) ridgeline theory,
see ridgeline.diagnosis
. This ignores argument
cutoff
.
ratio between density minimum between
components and minimum of density maxima according to Ray and
Lindsay's (2005) ridgeline theory, see
ridgeline.diagnosis
.
Bhattacharyya upper bound on misclassification
probability between two components, see
bhattacharyya.matrix
.
direct estimation of misclassification probability between components, see Hennig (2010).
this uses method="ridge.ratio"
to decide
which clusters to merge but stops merging according to the p-value of
the dip test computed as in Hartigan and Hartigan (1985), see
dip.test
.
as "dipuni"
, but p-value of dip test
computed as in Tantrum, Murua and Stuetzle (2003), see
dipp.tantrum
.
this uses method="demp"
to decide which
clusters to merge but stops merging according to the value of
prediction strength (Tibshirani and Walther, 2005) as computed in
mixpredictive
.
mergenormals
gives out an object of class mergenorm
,
which is a List with components
clustering |
integer vector. Final clustering. |
clusternumbers |
vector of numbers of remaining clusters. These
are given in terms of the original clusters even of
|
defunct.components |
vector of numbers of components that were "merged away". |
valuemerged |
vector of values of the merging criterion (see details) at which components were merged. |
mergedtonumbers |
vector of numbers of clusters to which the original components were merged. |
parameters |
a list, if |
predvalues |
vector of prediction strength values for
clusternumbers from 1 to the number of components in the original
mixture, if |
orig.decisionmatrix |
square matrix with entries giving the original values of the merging criterion (see details) for every pair of original mixture components. |
new.decisionmatrix |
square matrix as |
probs |
final cluster values of |
muarray |
final cluster means, analogous to |
Sigmaarray |
final cluster covariance matrices, analogous to
|
z |
final matrix of posterior probabilities of observations
belonging to the clusters, analogous to |
noise |
logical. If |
method |
as above. |
cutoff |
as above. |
summary.mergenorm
gives out a list with components
clustering, clusternumbers, defunct.components, valuemerged,
mergedtonumbers, predvalues, probs, muarray, Sigmaarray, z, noise,
method, cutoff
as above, plus onc
(original number of
components) and mnc
(number of clusters after merging).
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality, Annals of Statistics, 13, 70-84.
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
Tantrum, J., Murua, A. and Stuetzle, W. (2003) Assessment and Pruning of Hierarchical Model Based Clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 197-205.
Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.
require(mclust) require(MASS) options(digits=3) data(crabs) dc <- crabs[,4:8] cm <- mclustBIC(crabs[,4:8],G=9,modelNames="EEE") scm <- summary(cm,crabs[,4:8]) cmnbhat <- mergenormals(crabs[,4:8],scm,method="bhat") summary(cmnbhat) cmndemp <- mergenormals(crabs[,4:8],scm,method="demp") summary(cmndemp) # Other methods take a bit longer, but try them! # The values of by and M below are still chosen for reasonably fast execution. # cmnrr <- mergenormals(crabs[,4:8],scm,method="ridge.ratio",by=0.05) # cmd <- mergenormals(crabs[,4:8],scm,method="dip.tantrum",by=0.05) # cmp <- mergenormals(crabs[,4:8],scm,method="predictive",M=3)
require(mclust) require(MASS) options(digits=3) data(crabs) dc <- crabs[,4:8] cm <- mclustBIC(crabs[,4:8],G=9,modelNames="EEE") scm <- summary(cm,crabs[,4:8]) cmnbhat <- mergenormals(crabs[,4:8],scm,method="bhat") summary(cmnbhat) cmndemp <- mergenormals(crabs[,4:8],scm,method="demp") summary(cmndemp) # Other methods take a bit longer, but try them! # The values of by and M below are still chosen for reasonably fast execution. # cmnrr <- mergenormals(crabs[,4:8],scm,method="ridge.ratio",by=0.05) # cmd <- mergenormals(crabs[,4:8],scm,method="dip.tantrum",by=0.05) # cmp <- mergenormals(crabs[,4:8],scm,method="predictive",M=3)
Re-computes pointwise posterior probabilities, mean and covariance matrix for a mixture component obtained by merging two mixture components in a Gaussian mixture.
mergeparameters(xdata, j1, j2, probs, muarray,Sigmaarray, z)
mergeparameters(xdata, j1, j2, probs, muarray,Sigmaarray, z)
xdata |
data (something that can be coerced into a matrix). |
j1 |
integer. Number of first mixture component to be merged. |
j2 |
integer. Number of second mixture component to be merged. |
probs |
vector of component proportions (for all components; should sum up to one). |
muarray |
matrix of component means (rows). |
Sigmaarray |
array of component covariance matrices (third dimension refers to component number). |
z |
matrix of observation- (row-)wise posterior probabilities of belonging to the components (columns). |
List with components
probs |
see above; sum of probabilities for original components
|
muarray |
see above; weighted mean of means of component
|
Sigmaarray |
see above; weighted covariance matrix handled as above. |
z |
see above; original entries for columns |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
options(digits=3) set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss) siris <- summary(irisBIC,iriss) probs <- siris$parameters$pro muarray <- siris$parameters$mean Sigmaarray <- siris$parameters$variance$sigma z <- siris$z mpi <- mergeparameters(iriss,1,2,probs,muarray,Sigmaarray,z) mpi$probs mpi$muarray
options(digits=3) set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss) siris <- summary(irisBIC,iriss) probs <- siris$parameters$pro muarray <- siris$parameters$mean Sigmaarray <- siris$parameters$variance$sigma z <- siris$z mpi <- mergeparameters(iriss,1,2,probs,muarray,Sigmaarray,z) mpi$probs mpi$muarray
Computes the minimum size of a fixed point cluster (FPC) which is
found at least mtf
times with approximated
probability prob
by
ir
fixed point iterations of fixreg
.
Thought for use within fixreg
.
minsize(n, p, ir, mtf, prob = 0.5)
minsize(n, p, ir, mtf, prob = 0.5)
n |
positive integer. Total number of points. |
p |
positive integer. Number of independent variables. |
ir |
positive integer. Number of fixed point iterations. |
mtf |
positive integer. |
prob |
numerical between 0 and 1. |
The computation is based on the binomial distribution with probability
given by clusexpect
with ir=1
.
An integer.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
minsize(500,4,7000,2)
minsize(500,4,7000,2)
Computes density values for data from a mixture of multivariate Gaussian distributions with parameters based on the way models are specified and parameters are stored in package mclust.
mixdens(modelName,data,parameters)
mixdens(modelName,data,parameters)
modelName |
an mclust model name.
See |
data |
data matrix; density values are computed for every observation (row). |
parameters |
parameters of Gaussian mixture in the format used in
the output of |
Vector of density values for the observations.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss) siris <- summary(irisBIC,iriss) round(mixdens(siris$modelName,iriss,siris$parameters),digits=2)
set.seed(98765) require(mclust) iriss <- iris[sample(150,20),-5] irisBIC <- mclustBIC(iriss) siris <- summary(irisBIC,iriss) round(mixdens(siris$modelName,iriss,siris$parameters),digits=2)
Computes the prediction strength of clustering by
merging Gaussian mixture components, see mergenormals
.
The predictive strength is
defined according to Tibshirani and Walther (2005), carried out as
described in Hennig (2010), see details.
mixpredictive(xdata, Gcomp, Gmix, M=50, ...)
mixpredictive(xdata, Gcomp, Gmix, M=50, ...)
xdata |
data (something that can be coerced into a matrix). |
Gcomp |
integer. Number of components of the underlying Gaussian mixture. |
Gmix |
integer. Number of clusters after merging Gaussian components. |
M |
integer. Number of times the dataset is divided into two halves. |
... |
further arguments that can potentially arrive in calls but are currently not used. |
The prediction strength for a certain number of clusters Gmix
under a
random partition of the dataset in halves A and B is defined as
follows. Both halves are clustered with Gmix
clusters. Then the points of
A are classified to the clusters of B. This is done by use of the
maximum a posteriori rule for mixtures as in Hennig (2010),
differently from Tibshirani and Walther (2005). A pair of points A in
the same A-cluster is defined to be correctly predicted if both points
are classified into the same cluster on B. The same is done with the
points of B relative to the clustering on A. The prediction strength
for each of the clusterings is the minimum (taken over all clusters)
relative frequency of correctly predicted pairs of points of that
cluster. The final mean prediction strength statistic is the mean over
all 2M clusterings.
List with components
predcorr |
vector of length |
mean.pred |
mean of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.
prediction.strength
for Tibshirani and Walther's
original method.
mergenormals
for the clustering method applied here.
set.seed(98765) iriss <- iris[sample(150,20),-5] mp <- mixpredictive(iriss,2,2,M=2)
set.seed(98765) iriss <- iris[sample(150,20),-5] mp <- mixpredictive(iriss,2,2,M=2)
Discriminant projections as defined in Young, Marco and Odell (1987). The principle is to maximize the projection of a matrix consisting of the differences between the means of all classes and the first mean and the differences between the covariance matrices of all classes and the forst covariance matrix.
mvdcoord(xd, clvecd, clnum=1, sphere="mcd", ...)
mvdcoord(xd, clvecd, clnum=1, sphere="mcd", ...)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
clnum |
integer. Number of the class to which all differences are computed. |
sphere |
a covariance matrix or one of
"mve", "mcd", "classical", "none". The matrix used for sphering the
data. "mcd" and "mve" are robust covariance matrices as implemented
in |
... |
no effect |
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Young, D. M., Marco, V. R. and Odell, P. L. (1987). Quadratic discrimination: some results on optimal low-dimensional representation, Journal of Statistical Planning and Inference, 17, 307-319.
plotcluster
for straight forward discriminant plots.
discrproj
for alternatives.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0,p=3) grface <- as.integer(attr(face,"grouping")) mcf <- mvdcoord(face,grface) plot(mcf$proj,col=grface) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0,p=3) grface <- as.integer(attr(face,"grouping")) mcf <- mvdcoord(face,grface) plot(mcf$proj,col=grface) # ...done in one step by function plotcluster.
Neighborhood based discriminant coordinates as defined in Hastie and Tibshirani (1996) and a robustified version as defined in Hennig (2003). The principle is to maximize the projection of a between classes covariance matrix, which is defined by averaging the between classes covariance matrices in the neighborhoods of all points.
ncoord(xd, clvecd, nn=50, weighted=FALSE, sphere="mcd", orderall=TRUE, countmode=1000, ...)
ncoord(xd, clvecd, nn=50, weighted=FALSE, sphere="mcd", orderall=TRUE, countmode=1000, ...)
xd |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
integer vector of class numbers; length must equal
|
nn |
integer. Number of points which belong to the neighborhood of each point (including the point itself). |
weighted |
logical. |
sphere |
a covariance matrix or one of
"mve", "mcd", "classical", "none". The matrix used for sphering the
data. "mcd" and "mve" are robust covariance matrices as implemented
in |
orderall |
logical. By default, the neighborhoods are computed by
ordering all points each time. If |
countmode |
optional positive integer. Every |
... |
no effect |
List with the following components
ev |
eigenvalues in descending order. |
units |
columns are coordinates of projection basis vectors.
New points |
proj |
projections of |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hastie, T. and Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 607-616.
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
plotcluster
for straight forward discriminant plots.
discrproj
for alternatives.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) ncf <- ncoord(face,grface) plot(ncf$proj,col=grface) ncf2 <- ncoord(face,grface,weighted=TRUE) plot(ncf2$proj,col=grface) # ...done in one step by function plotcluster.
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) ncf <- ncoord(face,grface) plot(ncf$proj,col=grface) ncf2 <- ncoord(face,grface,weighted=TRUE) plot(ncf2$proj,col=grface) # ...done in one step by function plotcluster.
Cluster validity index based on the neg-entropy distances of within-cluster distributions to normal distribution, see Lago-Fernandez and Corbacho (2010).
neginc(x,clustering)
neginc(x,clustering)
x |
something that can be coerced into a numerical matrix. Euclidean dataset. |
clustering |
vector of integers with length |
Index value, see Lago-Fernandez and Corbacho (2010). The lower (i.e., the more negative) the better.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Lago-Fernandez, L. F. and Corbacho, F. (2010) Normality-based validation for crisp clustering. Pattern Recognition 43, 782-795.
options(digits=3) iriss <- as.matrix(iris[c(1:10,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:10,51:55,101:105),5]) neginc(iriss,irisc)
options(digits=3) iriss <- as.matrix(iris[c(1:10,51:55,101:105),-5]) irisc <- as.numeric(iris[c(1:10,51:55,101:105),5]) neginc(iriss,irisc)
Selection of the number of clusters via bootstrap as explained in Fang and Wang (2012). Several times 2 bootstrap samples are drawn from the data and the number of clusters is chosen by optimising an instability estimation from these pairs.
In principle all clustering methods can be used that have a
CBI-wrapper, see clusterboot
,
kmeansCBI
. However, the currently implemented
classification methods are not necessarily suitable for all of them,
see argument classification
.
nselectboot(data,B=50,distances=inherits(data,"dist"), clustermethod=NULL, classification="averagedist",centroidname = NULL, krange=2:10, count=FALSE,nnk=1, largeisgood=FALSE,...)
nselectboot(data,B=50,distances=inherits(data,"dist"), clustermethod=NULL, classification="averagedist",centroidname = NULL, krange=2:10, count=FALSE,nnk=1, largeisgood=FALSE,...)
data |
something that can be coerced into a matrix. The data
matrix - either an |
B |
integer. Number of resampling runs. |
distances |
logical. If |
clustermethod |
an interface function (the function name, not a
string containing the name, has to be provided!). This defines the
clustering method. See the "Details"-section of |
classification |
string.
This determines how non-clustered points are classified to given
clusters. Options are explained in |
centroidname |
string. Indicates the name of the component of
|
krange |
integer vector; numbers of clusters to be tried. |
count |
logical. If |
nnk |
number of nearest neighbours if
|
largeisgood |
logical. If |
... |
arguments to be passed on to the clustering method. |
nselectboot
returns a list with components
kopt,stabk,stab
.
kopt |
optimal number of clusters. |
stabk |
mean instability values for numbers of clusters (or one
minus this if |
stab |
matrix of instability values for all bootstrap runs and numbers of clusters. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477.
classifdist
, classifnp
,
clusterboot
,kmeansCBI
set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) nselectboot(dist(face),B=2,clustermethod=disthclustCBI, method="average",krange=5:7) nselectboot(dist(face),B=2,clustermethod=claraCBI, classification="centroid",krange=5:7) nselectboot(face,B=2,clustermethod=kmeansCBI, classification="centroid",krange=5:7) # Of course use larger B in a real application.
set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) nselectboot(dist(face),B=2,clustermethod=disthclustCBI, method="average",krange=5:7) nselectboot(dist(face),B=2,clustermethod=claraCBI, classification="centroid",krange=5:7) nselectboot(face,B=2,clustermethod=kmeansCBI, classification="centroid",krange=5:7) # Of course use larger B in a real application.
This calls the function pam
or
clara
to perform a
partitioning around medoids clustering with the number of clusters
estimated by optimum average silhouette width (see
pam.object
) or Calinski-Harabasz
index (calinhara
). The Duda-Hart test
(dudahart2
) is applied to decide whether there should be
more than one cluster (unless 1 is excluded as number of clusters or
data are dissimilarities).
pamk(data,krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data, "dist"), critout=FALSE, ns=10, seed=NULL, ...)
pamk(data,krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data, "dist"), critout=FALSE, ns=10, seed=NULL, ...)
data |
a data matrix or data frame or something that can be
coerced into a matrix, or dissimilarity matrix or
object. See |
krange |
integer vector. Numbers of clusters which are to be
compared by the average silhouette width criterion. Note: average
silhouette width and Calinski-Harabasz can't estimate number of
clusters |
criterion |
one of |
usepam |
logical. If |
scaling |
either a logical value or a numeric vector of length
equal to the number of variables. If |
alpha |
numeric between 0 and 1, tuning constant for
|
diss |
logical flag: if |
critout |
logical. If |
ns |
passed on to |
seed |
passed on to |
... |
A list with components
pamobject |
The output of the optimal run of the
|
nc |
the optimal number of clusters. |
crit |
vector of criterion values for numbers of
clusters. |
clara
and pam
can handle NA
-entries (see their documentation) but
dudahart2
cannot. Therefore NA
should not occur
if 1 is in krange
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Calinski, R. B., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) pk1 <- pamk(face,krange=1:5,criterion="asw",critout=TRUE) pk2 <- pamk(face,krange=1:5,criterion="multiasw",ns=2,critout=TRUE) # "multiasw" is better for larger data sets, use larger ns then. pk3 <- pamk(face,krange=1:5,criterion="ch",critout=TRUE)
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) pk1 <- pamk(face,krange=1:5,criterion="asw",critout=TRUE) pk2 <- pamk(face,krange=1:5,criterion="multiasw",ns=2,critout=TRUE) # "multiasw" is better for larger data sets, use larger ns then. pk3 <- pamk(face,krange=1:5,criterion="ch",critout=TRUE)
The Pi-function is given in (6) in Ray and Lindsay, 2005. Equating it to the mixture proportion yields locations of two-component Gaussian mixture density extrema.
piridge(alpha, mu1, mu2, Sigma1, Sigma2, showplot=FALSE)
piridge(alpha, mu1, mu2, Sigma1, Sigma2, showplot=FALSE)
alpha |
sequence of values between 0 and 1 for which the Pi-function is computed. |
mu1 |
mean vector of component 1. |
mu2 |
mean vector of component 2. |
Sigma1 |
covariance matrix of component 1. |
Sigma2 |
covariance matrix of component 2. |
showplot |
logical. If |
Vector of values of the Pi-function for values of alpha
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
q <- piridge(seq(0,1,0.1),c(1,1),c(2,5),diag(2),diag(2))
q <- piridge(seq(0,1,0.1),c(1,1),c(2,5),diag(2),diag(2))
By use of the Pi-function in Ray and Lindsay, 2005, locations of two-component Gaussian mixture density extrema or saddlepoints are computed.
piridge.zeroes(prop, mu1, mu2, Sigma1, Sigma2, alphamin=0, alphamax=1,by=0.001)
piridge.zeroes(prop, mu1, mu2, Sigma1, Sigma2, alphamin=0, alphamax=1,by=0.001)
prop |
proportion of mixture component 1. |
mu1 |
mean vector of component 1. |
mu2 |
mean vector of component 2. |
Sigma1 |
covariance matrix of component 1. |
Sigma2 |
covariance matrix of component 2. |
alphamin |
minimum alpha value. |
alphamax |
maximum alpha value. |
by |
interval between alpha-values where to look for extrema. |
list with components
number.zeroes |
number of zeroes of Pi-function, i.e., extrema or saddlepoints of density. |
estimated.roots |
estimated |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
q <- piridge.zeroes(0.2,c(1,1),c(2,5),diag(2),diag(2),by=0.1)
q <- piridge.zeroes(0.2,c(1,1),c(2,5),diag(2),diag(2),by=0.1)
Visualisation and print function for cluster validation output compared to results on simulated random clusterings. The print method can also be used to compute and print an aggregated cluster validation index.
Unlike for many other plot methods, the additional arguments
of plot.valstat
are essential. print.valstat
should make
good sense with the defaults, but for computing the aggregate index
need to be set.
## S3 method for class 'valstat' plot(x,simobject=NULL,statistic="sindex", xlim=NULL,ylim=c(0,1), nmethods=length(x)-5, col=1:nmethods,cex=1,pch=c("c","f","a","n"), simcol=rep(grey(0.7),4), shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...) ## S3 method for class 'valstat' print(x,statistics=x$statistics, nmethods=length(x)-5,aggregate=FALSE, weights=NULL,digits=2, include.othernc=NULL,...)
## S3 method for class 'valstat' plot(x,simobject=NULL,statistic="sindex", xlim=NULL,ylim=c(0,1), nmethods=length(x)-5, col=1:nmethods,cex=1,pch=c("c","f","a","n"), simcol=rep(grey(0.7),4), shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...) ## S3 method for class 'valstat' print(x,statistics=x$statistics, nmethods=length(x)-5,aggregate=FALSE, weights=NULL,digits=2, include.othernc=NULL,...)
x |
object of class |
simobject |
list of simulation results as produced by
|
statistic |
one of |
xlim |
passed on to |
ylim |
passed on to |
nmethods |
integer. Number of clustering methods to involve
(these are those from number 1 to |
col |
colours used for the different clustering methods. |
cex |
passed on to |
pch |
vector of symbols for random clustering results from
|
simcol |
vector of colours used for random clustering results in
order |
shift |
numeric vector. Indicates the amount to which the results
from |
include.othernc |
this indicates whether methods should be
included that estimated their number of clusters themselves and gave
a result outside the standard range as given by |
statistics |
vector of character strings specifying the validation statistics that will be included in the output (unless you want to restrict the output for some reason, the default should be fine. |
aggregate |
logical. If |
weights |
vector of numericals. Weights for computation of the
aggregate statistic in case that |
digits |
minimal number of significant digits, passed on to
|
... |
no effect. |
Whereas print.valstat
, at least with aggregate=TRUE
makes more sense for the qstat
or sstat
-component of the
clusterbenchstats
-output rather than the
stat
-component, plot.valstat
should be run with the
stat
-component if simobject
is specified, because the
simulated cluster validity statistics are unstandardised and need to
be compared with unstandardised values on the dataset of interest.
print.valstat
will print all values for all validation indexes
and the aggregated index (in case of aggregate=TRUE
and set
weights
will be printed last.
print.valstats
returns the results table as invisible object.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
clusterbenchstats
, valstat.object
,
cluster.magazine
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI","hclustCBI") clustermethodpars <- list() clustermethodpars[[2]] <- clustermethodpars[[3]] <- list() clustermethodpars[[2]]$method <- "ward.D2" clustermethodpars[[3]]$method <- "single" methodname <- c("kmeans","ward","single") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,3), clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2) plot(cbs$stat,cbs$sim) plot(cbs$stat,cbs$sim,statistic="dindex") plot(cbs$stat,cbs$sim,statistic="avewithin") pcbs <- print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0)) # Some of the values are "NaN" because due to the low number of runs of # the stupid clustering methods there is no variation. If this happens # in a real application, nnruns etc. should be chosen higher than 2. # Also useallg=TRUE in clusterbenchstats may help. # # Finding the best aggregated value: mpcbs <- as.matrix(pcbs[[17]][,-1]) which(mpcbs==max(mpcbs),arr.ind=TRUE) # row=1 refers to the first clustering method kmeansCBI, # col=2 refers to the second number of clusters, which is 3 in g=2:3.
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI","hclustCBI") clustermethodpars <- list() clustermethodpars[[2]] <- clustermethodpars[[3]] <- list() clustermethodpars[[2]]$method <- "ward.D2" clustermethodpars[[3]]$method <- "single" methodname <- c("kmeans","ward","single") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,3), clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2) plot(cbs$stat,cbs$sim) plot(cbs$stat,cbs$sim,statistic="dindex") plot(cbs$stat,cbs$sim,statistic="avewithin") pcbs <- print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0)) # Some of the values are "NaN" because due to the low number of runs of # the stupid clustering methods there is no variation. If this happens # in a real application, nnruns etc. should be chosen higher than 2. # Also useallg=TRUE in clusterbenchstats may help. # # Finding the best aggregated value: mpcbs <- as.matrix(pcbs[[17]][,-1]) which(mpcbs==max(mpcbs),arr.ind=TRUE) # row=1 refers to the first clustering method kmeansCBI, # col=2 refers to the second number of clusters, which is 3 in g=2:3.
Plots to distinguish given classes by ten available projection methods. Includes classical discriminant coordinates, methods to project differences in mean and covariance structure, asymmetric methods (separation of a homogeneous class from a heterogeneous one), local neighborhood-based methods and methods based on robust covariance matrices. One-dimensional data is plotted against the cluster number.
plotcluster(x, clvecd, clnum=NULL, method=ifelse(is.null(clnum),"dc","awc"), bw=FALSE, ignorepoints=FALSE, ignorenum=0, pointsbyclvecd=TRUE, xlab=NULL, ylab=NULL, pch=NULL, col=NULL, ...)
plotcluster(x, clvecd, clnum=NULL, method=ifelse(is.null(clnum),"dc","awc"), bw=FALSE, ignorepoints=FALSE, ignorenum=0, pointsbyclvecd=TRUE, xlab=NULL, ylab=NULL, pch=NULL, col=NULL, ...)
x |
the data matrix; a numerical object which can be coerced to a matrix. |
clvecd |
vector of class numbers which can be coerced into
integers; length must equal
|
method |
one of
Note that "bc", "vbc", "adc", "awc", "arc" and "anc" assume that there are only two classes. |
clnum |
integer. Number of the class which is attempted to plot
homogeneously by "asymmetric methods", which are the methods
assuming that there are only two classes, as indicated above.
|
bw |
logical. If |
ignorepoints |
logical. If |
ignorenum |
one of the potential values of the components of
|
pointsbyclvecd |
logical. If |
xlab |
label for x-axis. If |
ylab |
label for y-axis. If |
pch |
plotting symbol, see |
col |
plotting color, see |
... |
additional parameters passed to |
For some of the asymmetric methods, the area in the plot
occupied by the "homogeneous class" (see clnum
above) may be
very small, and it may make sense to run plotcluster
a second
time specifying plot parameters xlim
and ylim
in a
suitable way. It often makes sense to magnify the plot region
containing the homogeneous class in this way
so that its separation from the rest can be
seen more clearly.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 .
Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160.
Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley.
Fukunaga (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Boston: Academic Press.
discrcoord
, batcoord
,
mvdcoord
, adcoord
,
awcoord
, ncoord
,
ancoord
.
discrproj
is an interface to all these projection methods.
rFace
for generation of the example data used below.
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) plotcluster(face,grface) plotcluster(face,grface==1) plotcluster(face,grface, clnum=1, method="vbc")
set.seed(4634) face <- rFace(300,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) plotcluster(face,grface) plotcluster(face,grface==1) plotcluster(face,grface, clnum=1, method="vbc")
Computes the prediction strength of a clustering of a dataset into different numbers of components. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. See details.
Various clustering methods can be used, see argument
clustermethod
. In Tibshirani and Walther (2005), only
classification to the nearest centroid is discussed, but more methods
are offered here, see argument classification
.
prediction.strength(xdata, Gmin=2, Gmax=10, M=50, clustermethod=kmeansCBI, classification="centroid", centroidname = NULL, cutoff=0.8,nnk=1, distances=inherits(xdata,"dist"),count=FALSE,...) ## S3 method for class 'predstr' print(x, ...)
prediction.strength(xdata, Gmin=2, Gmax=10, M=50, clustermethod=kmeansCBI, classification="centroid", centroidname = NULL, cutoff=0.8,nnk=1, distances=inherits(xdata,"dist"),count=FALSE,...) ## S3 method for class 'predstr' print(x, ...)
xdata |
data (something that can be coerced into a matrix). |
Gmin |
integer. Minimum number of clusters. Note that the
prediction strength for 1 cluster is trivially 1, which is
automatically included if |
Gmax |
integer. Maximum number of clusters. |
M |
integer. Number of times the dataset is divided into two halves. |
clustermethod |
an interface function (the function name, not a
string containing the name, has to be provided!). This defines the
clustering method. See the "Details"-section of |
classification |
string.
This determines how non-clustered points are classified to given
clusters. Options are explained in |
centroidname |
string. Indicates the name of the component of
|
cutoff |
numeric between 0 and 1. The optimal number of clusters
is the maximum one with prediction strength above |
nnk |
number of nearest neighbours if
|
distances |
logical. If |
count |
logical. |
x |
object of class |
... |
arguments to be passed on to the clustering method. |
The prediction strength for a certain number of clusters k under a
random partition of the dataset in halves A and B is defined as
follows. Both halves are clustered with k clusters. Then the points of
A are classified to the clusters of B. In the original paper
this is done by assigning every
observation in A to the closest cluster centroid in B (corresponding
to classification="centroid"
), but other methods are possible,
see classifnp
. A pair of points A in
the same A-cluster is defined to be correctly predicted if both points
are classified into the same cluster on B. The same is done with the
points of B relative to the clustering on A. The prediction strength
for each of the clusterings is the minimum (taken over all clusters)
relative frequency of correctly predicted pairs of points of that
cluster. The final mean prediction strength statistic is the mean over
all 2M clusterings.
prediction.strength
gives out an object of class
predstr
, which is a
list with components
predcorr |
list of vectors of length |
mean.pred |
means of |
optimalk |
optimal number of clusters. |
cutoff |
see above. |
method |
a string identifying the clustering method. |
Gmax |
see above. |
M |
see above. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.
options(digits=3) set.seed(98765) iriss <- iris[sample(150,20),-5] prediction.strength(iriss,2,3,M=3) prediction.strength(iriss,2,3,M=3,clustermethod=claraCBI) # The examples are fast, but of course M should really be larger.
options(digits=3) set.seed(98765) iriss <- iris[sample(150,20),-5] prediction.strength(iriss,2,3,M=3) prediction.strength(iriss,2,3,M=3,clustermethod=claraCBI) # The examples are fast, but of course M should really be larger.
For use within regmix
. Generates a random
0-1-matrix with n
rows
and cln
columns so that every row contains exactly one one and
every columns contains at least p+3
ones.
randcmatrix(n,cln,p)
randcmatrix(n,cln,p)
n |
positive integer. Number of rows. |
cln |
positive integer. Number of columns. |
p |
positive integer. See above. |
An n*cln
-matrix.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(111) randcmatrix(10,2,1)
set.seed(111) randcmatrix(10,2,1)
Generates a logical vector of length n
with p TRUE
s.
randconf(n, p)
randconf(n, p)
n |
positive integer. |
p |
positive integer. |
A logical vector.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
randconf(10,3)
randconf(10,3)
For a given dataset this simulates random clusterings using
stupidkcentroids
, stupidknn
,
stupidkfn
, and stupidkaven
. It then
computes and stores a set of cluster validity indexes for every
clustering.
randomclustersim(datadist,datanp=NULL,npstats=FALSE,useboot=FALSE, bootmethod="nselectboot", bootruns=25, G,nnruns=100,kmruns=100,fnruns=100,avenruns=100, nnk=4,dnnk=2, pamcrit=TRUE, multicore=FALSE,cores=detectCores()-1,monitor=TRUE)
randomclustersim(datadist,datanp=NULL,npstats=FALSE,useboot=FALSE, bootmethod="nselectboot", bootruns=25, G,nnruns=100,kmruns=100,fnruns=100,avenruns=100, nnk=4,dnnk=2, pamcrit=TRUE, multicore=FALSE,cores=detectCores()-1,monitor=TRUE)
datadist |
distances on which validation-measures are based, |
datanp |
optional observations times variables data matrix, see
|
npstats |
logical. If |
useboot |
logical. If |
bootmethod |
either |
bootruns |
integer. Number of resampling runs. If
|
G |
vector of integers. Numbers of clusters to consider. |
nnruns |
integer. Number of runs of |
kmruns |
integer. Number of runs of |
fnruns |
integer. Number of runs of |
avenruns |
integer. Number of runs of |
nnk |
|
dnnk |
|
pamcrit |
|
multicore |
logical. If |
cores |
integer. Number of cores for parallelisation. |
monitor |
logical. If |
List with components
nn |
list, indexed by number of clusters. Every entry is
a data frame with |
fn |
list, indexed by number of clusters. Every entry is
a data frame with |
aven |
list, indexed by number of clusters. Every entry is
a data frame with |
km |
list, indexed by number of clusters. Every entry is
a data frame with |
nnruns |
number of involved runs of |
fnruns |
number of involved runs of |
avenruns |
number of involved runs of |
kmruns |
number of involved runs of |
boot |
if |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
stupidkcentroids
, stupidknn
, stupidkfn
, stupidkaven
, clustatsum
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) rmx <- randomclustersim(dist(face),datanp=face,npstats=TRUE,G=2:3, nnruns=2,kmruns=2, fnruns=1,avenruns=1,nnk=2) ## Not run: rmx$km # Produces slightly different but basically identical results on ATLAS ## End(Not run) rmx$aven rmx$fn rmx$nn
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) rmx <- randomclustersim(dist(face),datanp=face,npstats=TRUE,G=2:3, nnruns=2,kmruns=2, fnruns=1,avenruns=1,nnk=2) ## Not run: rmx$km # Produces slightly different but basically identical results on ATLAS ## End(Not run) rmx$aven rmx$fn rmx$nn
Computes an ML-estimator for clusterwise linear regression under a
regression mixture model with Normal errors. Parameters are
proportions, regression coefficients and error variances, all
independent of the values of the independent variable, and all may
differ for different clusters. Computation is by the EM-algorithm. The
number of clusters is estimated via the Bayesian Information Criterion
(BIC). Note that package flexmix
has more sophisticated tools to do the
same thing and is recommended. The functions are kept in here only for
compatibility reasons.
regmix(indep, dep, ir=1, nclust=1:7, icrit=1.e-5, minsig=1.e-6, warnings=FALSE) regem(indep, dep, m, cln, icrit=1.e-5, minsig=1.e-6, warnings=FALSE)
regmix(indep, dep, ir=1, nclust=1:7, icrit=1.e-5, minsig=1.e-6, warnings=FALSE) regem(indep, dep, m, cln, icrit=1.e-5, minsig=1.e-6, warnings=FALSE)
indep |
numerical matrix or vector. Independent variables. |
dep |
numerical vector. Dependent variable. |
ir |
positive integer. Number of iteration runs for every number of clusters. |
nclust |
vector of positive integers. Numbers of clusters. |
icrit |
positive numerical. Stopping criterion for the iterations (difference of loglikelihoods). |
minsig |
positive numerical. Minimum value for the variance parameters (likelihood is unbounded if variances are allowed to converge to 0). |
warnings |
logical. If |
cln |
positive integer. (Single) number of clusters. |
m |
matrix of positive numericals. Number of columns must be
|
The result of the EM iteration depends on the initial configuration,
which is generated randomly by randcmatrix
for regmix
.
regmix
calls regem
. To provide the initial configuration
manually, use parameter m
of
regem
directly. Take a look at the example about how to
generate m
if you want to specify initial parameters.
The original paper DeSarbo and Cron (1988) suggests the AIC for
estimating the number of clusters. The use of the BIC is advocated by
Wedel and DeSarbo (1995). The BIC is defined here as
2*loglik - log(n)*((p+3)*cln-1)
, p
being the number of
independent variables, i.e., the larger the better.
See the entry for the input parameter warnings
for the
treatment of several numerical problems.
regmix
returns a list
containing the components clnopt, loglik, bic,
coef, var, eps, z, g
.
regem
returns a list
containing the components loglik,
coef, var, z, g, warn
.
clnopt |
optimal number of clusters according to the BIC. |
loglik |
loglikelihood for the optimal model. |
bic |
vector of BIC values for all numbers of clusters in |
coef |
matrix of regression coefficients. First row: intercept parameter. Second row: parameter of first independent variable and so on. Columns corresponding to clusters. |
var |
vector of error variance estimators for the clusters. |
eps |
vector of cluster proportion estimators. |
z |
matrix of estimated a posteriori probabilities of the points
(rows) to be generated by the clusters (columns). Compare input
argument |
g |
integer vector of estimated cluster numbers for the points
(via argmax over |
warn |
logical. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
DeSarbo, W. S. and Cron, W. L. (1988) A maximum likelihood methodology for clusterwise linear regression, Journal of Classification 5, 249-282.
Wedel, M. and DeSarbo, W. S. (1995) A mixture likelihood approach for generalized linear models, Journal of Classification 12, 21-56.
Regression mixtures can also (and probably better) be computed with the
flexmix package, see flexmix
. (When I first write
the regmix
-function, flexmix
didn't exist.)
fixreg
for fixed point clusters for clusterwise linear regression.
EMclust
for Normal mixture model fitting (non-regression).
## Not run: # This apparently gives slightly different # but data-analytically fine results # on some versions of R. set.seed(12234) data(tonedata) attach(tonedata) rmt1 <- regmix(stretchratio,tuned,nclust=1:2) # nclust=1:2 makes the example fast; # a more serious application would rather use the default. rmt1$g round(rmt1$bic,digits=2) # start with initial parameter values cln <- 3 n <- 150 initcoef <- cbind(c(2,0),c(0,1),c(0,2.5)) initvar <- c(0.001,0.0001,0.5) initeps <- c(0.4,0.3,0.3) # computation of m from initial parameters m <- matrix(nrow=n, ncol=cln) stm <- numeric(0) for (i in 1:cln) for (j in 1:n){ m[j,i] <- initeps[i]*dnorm(tuned[j],mean=initcoef[1,i]+ initcoef[2,i]*stretchratio[j], sd=sqrt(initvar[i])) } for (j in 1:n){ stm[j] <- sum(m[j,]) for (i in 1:cln) m[j,i] <- m[j,i]/stm[j] } rmt2 <- regem(stretchratio, tuned, m, cln) ## End(Not run)
## Not run: # This apparently gives slightly different # but data-analytically fine results # on some versions of R. set.seed(12234) data(tonedata) attach(tonedata) rmt1 <- regmix(stretchratio,tuned,nclust=1:2) # nclust=1:2 makes the example fast; # a more serious application would rather use the default. rmt1$g round(rmt1$bic,digits=2) # start with initial parameter values cln <- 3 n <- 150 initcoef <- cbind(c(2,0),c(0,1),c(0,2.5)) initvar <- c(0.001,0.0001,0.5) initeps <- c(0.4,0.3,0.3) # computation of m from initial parameters m <- matrix(nrow=n, ncol=cln) stm <- numeric(0) for (i in 1:cln) for (j in 1:n){ m[j,i] <- initeps[i]*dnorm(tuned[j],mean=initcoef[1,i]+ initcoef[2,i]*stretchratio[j], sd=sqrt(initvar[i])) } for (j in 1:n){ stm[j] <- sum(m[j,]) for (i in 1:cln) m[j,i] <- m[j,i]/stm[j] } rmt2 <- regem(stretchratio, tuned, m, cln) ## End(Not run)
Generates "face-shaped" clustered benchmark datasets. This is based on a collaboration with Martin Maechler.
rFace(n, p = 6, nrep.top = 2, smile.coef = 0.6, dMoNo = 1.2, dNoEy = 1)
rFace(n, p = 6, nrep.top = 2, smile.coef = 0.6, dMoNo = 1.2, dNoEy = 1)
n |
integer greater or equal to 10. Number of points. |
p |
integer greater or equal to 2. Dimension. |
nrep.top |
integer. Number of repetitions of the hair-top point. |
smile.coef |
numeric. Coefficient for quadratic term used for generation of mouth-points. Positive values=>smile. |
dMoNo |
number. Distance from mouth to nose. |
dNoEy |
number. Minimum vertical distance from mouth to eyes. |
The function generates a nice benchmark example for cluster
analysis.
There are six "clusters" in this data, of which the first five are
clearly homogeneous patterns, but with different distributional
shapes and different qualities of separation. The clusters are
distinguished only in the first two dimensions. The attribute
grouping
is a factor giving the cluster numbers, see below.
The sixth group of
points corresponds to some hairs, and is rather a collection of
outliers than a cluster in itself. This group contains
nrep.top+2
points. Of the remaining points, 20% belong to
cluster 1, the chin (quadratic function plus noise).
10% belong to cluster 2, the right eye (Gaussian). 30% belong to
cluster 3, the mouth (Gaussian/squared Gaussian).
20% belong to cluster 4, the nose (Gaussian/gamma), and
20% belong to cluster 5, the left eye (uniform).
The distributions of the further variables are homogeneous over all points. The third dimension is exponentially distributed, the fourth dimension is Cauchy distributed, all further distributions are Gaussian.
Please consider the source code for exact generation of the clusters.
An n
times p
numeric matrix with attributes
grouping |
a factor giving the cluster memberships of the points. |
indexlist |
a list of six vectors containing the indices of points belonging to the six groups. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) plot(face, col = grface) # pairs(face, col = grface, main ="rFace(600,dMoNo=2,dNoEy=0)")
set.seed(4634) face <- rFace(600,dMoNo=2,dNoEy=0) grface <- as.integer(attr(face,"grouping")) plot(face, col = grface) # pairs(face, col = grface, main ="rFace(600,dMoNo=2,dNoEy=0)")
Computes
as required for the
computation of the ridgeline (Ray and Lindsay, 2005) to find
all density extrema of a two-component Gaussian mixture with
mean vectors mu1 and mu2 and covariance matrices Sigma1, Sigma2.
ridgeline(alpha, mu1, mu2, Sigma1, Sigma2)
ridgeline(alpha, mu1, mu2, Sigma1, Sigma2)
alpha |
numeric between 0 and 1. |
mu1 |
mean vector of component 1. |
mu2 |
mean vector of component 2. |
Sigma1 |
covariance matrix of component 1. |
Sigma2 |
covariance matrix of component 2. |
A vector. See above.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
ridgeline(0.5,c(1,1),c(2,5),diag(2),diag(2))
ridgeline(0.5,c(1,1),c(2,5),diag(2),diag(2))
Computes ridgeline ratios and unimodality checks for pairs of components given the parameters of a Gaussian mixture. Produces ridgeline plots.
ridgeline.diagnosis (propvector,muarray,Sigmaarray, k=length(propvector), ipairs="all", compute.ratio=TRUE,by=0.001, ratiocutoff=NULL,ridgelineplot="matrix")
ridgeline.diagnosis (propvector,muarray,Sigmaarray, k=length(propvector), ipairs="all", compute.ratio=TRUE,by=0.001, ratiocutoff=NULL,ridgelineplot="matrix")
propvector |
vector of component proportions. Length must be number of components, and must sum up to 1. |
muarray |
matrix of component means (different components are in different columns). |
Sigmaarray |
three dimensional array with component covariance matrices (the third dimension refers to components). |
k |
integer. Number of components. |
ipairs |
|
compute.ratio |
logical. If |
by |
real between 0 and 1. Interval width for density computation along the ridgeline. |
ratiocutoff |
real between 0 and 1. If not |
ridgelineplot |
one of |
A list with components
merged.clusters |
vector of integers, stating for every mixture
component the number of the cluster of components that would be merged
by merging connectivity components of the graph specified by
|
connection.matrix |
zero-one matrix, in which a one means that the
mixture of the corresponding pair of components of the original
mixture is either unimodel (if |
ratio.matrix |
matrix with entries between 0 und 1, giving the ridgeline ratio, which is the density minimum of the mixture of the corresponding pair of components along the ridgeline divided by the minimum of the two maxima closest to the beginning and the end of the ridgeline. |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010a) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
Hennig, C. (2010b) Ridgeline plot and clusterwise stability as tools for merging Gaussian mixture components. To appear in Classification as a Tool for Research, Proceedings of IFCS 2009.
Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.
ridgeline
, dridgeline
,
piridge
, piridge.zeroes
muarray <- cbind(c(0,0),c(0,0.1),c(10,10)) sigmaarray <- array(c(diag(2),diag(2),diag(2)),dim=c(2,2,3)) rd <- ridgeline.diagnosis(c(0.5,0.3,0.2),muarray,sigmaarray,ridgelineplot="matrix",by=0.1) # Much slower but more precise with default by=0.001.
muarray <- cbind(c(0,0),c(0,0.1),c(10,10)) sigmaarray <- array(c(diag(2),diag(2),diag(2)),dim=c(2,2,3)) rd <- ridgeline.diagnosis(c(0.5,0.3,0.2),muarray,sigmaarray,ridgelineplot="matrix",by=0.1) # Much slower but more precise with default by=0.001.
Extracts the information about the size of the intersections
between representative
Fixed Point Clusters (FPCs) of stable groups from the output of
the FPC-functions fixreg
and fixmahal
.
simmatrix(fpcobj)
simmatrix(fpcobj)
fpcobj |
an object of class |
A non-negative real-valued vector giving the number of points in the intersections of the representative FPCs of stable groups.
The intersection between representative FPCs no. i
and
j
is at position sseg(i,j)
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
set.seed(190000) data(tonedata) # Note: If you do not use the installed package, replace this by # tonedata <- read.table("(path/)tonedata.txt", header=TRUE) attach(tonedata) tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20) simmatrix(tonefix)[sseg(2,3)]
set.seed(190000) data(tonedata) # Note: If you do not use the installed package, replace this by # tonedata <- read.table("(path/)tonedata.txt", header=TRUE) attach(tonedata) tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20) simmatrix(tonefix)[sseg(2,3)]
Tries to invert a matrix by solve
. If this fails because of
singularity, an
eigenvector decomposition is computed, and eigenvalues below
1/cmax
are replaced by 1/cmax
, i.e., cmax
will be
the corresponding eigenvalue of the inverted matrix.
solvecov(m, cmax = 1e+10)
solvecov(m, cmax = 1e+10)
m |
a numeric symmetric matrix. |
cmax |
a positive value, see above. |
A list with the following components:
inv |
the inverted matrix |
coll |
|
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
x <- c(1,0,0,1,0,1,0,0,1) dim(x) <- c(3,3) solvecov(x)
x <- c(1,0,0,1,0,1,0,0,1) dim(x) <- c(3,3) solvecov(x)
sseg(i,j)
gives the position of the similarity of objects
i
and j
in the similarity vectors produced by
fixreg
and fixmahal
.
sseg
should only be used as an auxiliary function in
fixreg
and fixmahal
.
sseg(i, j)
sseg(i, j)
i |
positive integer. |
j |
positive integer. |
A positive integer.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
sseg(3,4)
sseg(3,4)
Picks k random starting points from given dataset to initialise k clusters. Then, one by one, the point not yet assigned to any cluster with smallest average dissimilarity to the points of any already existing cluster is assigned to that cluster, until all points are assigned. This is a random versione of average linkage clustering, see Akhanli and Hennig (2020).
stupidkaven(d,k)
stupidkaven(d,k)
d |
|
k |
integer. Number of clusters. |
The clustering vector (values 1 to k
, length number of objects
behind d
),
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
stupidkcentroids
, stupidknn
, stupidkfn
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkaven(dist(face),3)
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkaven(dist(face),3)
Picks k random centroids from given dataset and assigns every point to closest centroid. This is called stupid k-centroids in Hennig (2019).
stupidkcentroids(xdata, k, distances = inherits(xdata, "dist"))
stupidkcentroids(xdata, k, distances = inherits(xdata, "dist"))
xdata |
cases*variables data, |
k |
integer. Number of clusters. |
distances |
logical. If |
A list with components
partition |
vector if integers 1 to |
centroids |
vector of integers of length |
distances |
as argument |
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
stupidknn
, stupidkfn
, stupidkaven
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkcentroids(dist(face),3)
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkcentroids(dist(face),3)
Picks k random starting points from given dataset to initialise k clusters. Then, one by one, a point not yet assigned to any cluster is assigned to that cluster, until all points are assigned. The point/cluster pair to be used is picked according to the smallest distance of a point to the farthest point to it in any of the already existing clusters as in complete linkage clustering, see Akhanli and Hennig (2020).
stupidkfn(d,k)
stupidkfn(d,k)
d |
|
k |
integer. Number of clusters. |
The clustering vector (values 1 to k
, length number of objects
behind d
),
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
stupidkcentroids
, stupidknn
, stupidkaven
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkfn(dist(face),3)
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidkfn(dist(face),3)
Picks k random starting points from given dataset to initialise k clusters. Then, one by one, the point not yet assigned to any cluster that is closest to an already assigned point is assigned to that cluster, until all points are assigned. This is called stupid nearest neighbour clustering in Hennig (2019).
stupidknn(d,k)
stupidknn(d,k)
d |
|
k |
integer. Number of clusters. |
The clustering vector (values 1 to k
, length number of objects
behind d
),
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
stupidkcentroids
, stupidkfn
, stupidkaven
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidknn(dist(face),3)
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) stupidknn(dist(face),3)
Computes transposed eigenvectors of matrix m
times diagonal of
square root of eigenvalues so that eigenvalues smaller than 1e-6 are
set to 1e-6.
tdecomp(m)
tdecomp(m)
m |
a symmetric matrix of minimum format 2*2. |
Thought for use in discrcoord
only.
a matrix.
Thought for use within discrcoord
only.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
x <- rnorm(10) y <- rnorm(10) z <- cov(cbind(x,y)) round(tdecomp(z),digits=2)
x <- rnorm(10) y <- rnorm(10) z <- cov(cbind(x,y)) round(tdecomp(z),digits=2)
The tone perception data stem
from an experiment of Cohen (1980) and have been analyzed in de Veaux
(1989).
A pure fundamental tone was played to a
trained musician. Electronically generated overtones were added, determined
by a stretching ratio of stretchratio
. stretchratio=2.0
corresponds to the harmonic pattern
usually heard in traditional definite pitched instruments. The musician was
asked to tune an adjustable tone to the octave above the fundamental tone.
tuned
gives the ratio of the adjusted tone to the fundamental,
i.e. tuned=2.0
would be the correct tuning for all
stretchratio
-values.
The data analyzed here belong to 150 trials
with the same musician. In the original study, there were four further
musicians.
data(tonedata)
data(tonedata)
A data frame with 2 variables stretchratio
and
tuned
and 150 cases.
Cohen, E. A. (1980) Inharmonic tone perception. Unpublished Ph.D. dissertation, Stanford University
de Veaux, R. D. (1989) Mixtures of Linear Regressions, Computational Statistics and Data Analysis 8, 227-245.
Checks whether a series of fitted density values (such as given out as
y
-component of density
) is unimodal.
unimodal.ind(y)
unimodal.ind(y)
y |
numeric vector of fitted density values in order of
increasing x-values such as given out as
|
Logical. TRUE
if unimodal.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
unimodal.ind(c(1,3,3,4,2,1,0,0))
unimodal.ind(c(1,3,3,4,2,1,0,0))
The objects of class "valstat"
store cluster validation
statistics from various clustering methods run with various numbers of
clusters.
A legitimate valstat
object is a list. The format of the list
relies on the number of involved clustering methods, nmethods
,
say, i.e., the length
of the method
-component explained below. The first
nmethods
elements of the valstat
-list are just
numbered. These are themselves lists that are numbered between 1 and
the maxG
-component defined below. Element [[i]][[j]]
refers to the clustering from clustering method number i with number
of clusters j. Every such element is a list
with components
avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep,
asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy
:
Further optional components are pamc, kdnorm, kdunif,
dmode, aggregated
. All these are cluster validation indexes, as
follows.
avewithin |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
mnnd |
average distance to |
cvnnd |
coefficient of variation of dissimilarities to
|
maxdiameter |
maximum cluster diameter. |
widestgap |
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter |
sindex |
separation index. Defined based on the distances for
every point to the
closest point not in the same cluster. The separation index is then
the mean of the smallest proportion |
minsep |
minimum cluster separation. |
asw |
average silhouette
width. See |
dindex |
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good. |
denscut |
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good. |
highdgap |
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good. |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
withinss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
pamc |
average distance to cluster centroid, which is the observation that minimises this average distance. |
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
dmode |
aggregated density mode index equal to
|
Furthermore, a valstat
object
has the following list components:
maxG |
maximum number of clusters. |
minG |
minimum number of clusters (list entries below that number are empty lists). |
method |
vector of names (character strings) of clustering
CBI-functions, see |
name |
vector of names (character strings) of clustering
methods. These can be user-chosen names (see argument
|
statistics |
vector of names (character strings) of cluster validation indexes. |
These objects are generated as part of the
clusterbenchstats
-output.
The valstat
class has methods for the following generic functions:
print
, plot
, see plot.valstat
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
clusterbenchstats
,
plot.valstat
.
Ordered posterior plots for Gaussian mixture components, see Hennig (2010).
weightplots(z, clusternumbers="all", clustercol=2, allcol=grey(0.2+((1:ncol(z))-1)* 0.6/(ncol(z)-1)), lty=rep(1,ncol(z)),clusterlwd=3, legendposition="none", weightcutoff=0.01,ask=TRUE, ...)
weightplots(z, clusternumbers="all", clustercol=2, allcol=grey(0.2+((1:ncol(z))-1)* 0.6/(ncol(z)-1)), lty=rep(1,ncol(z)),clusterlwd=3, legendposition="none", weightcutoff=0.01,ask=TRUE, ...)
z |
matrix with rows corresponding to observations and columns
corresponding to mixture components. Entries are probabilities that
an observation has been generated by a mixture component. These will
normally be estimated a posteriori probabilities, as generated as
component |
clusternumbers |
|
clustercol |
colour used for the main components for which a plot is drawn. |
allcol |
colours used for respective other components in plots in which they are not main components. |
lty |
line types for components. |
clusterlwd |
numeric. Line width for main component. |
legendposition |
|
weightcutoff |
numeric between 0 and 1. Observations are only taken into account for which the posterior probability for the main component is larger than this. |
ask |
logical. If |
... |
further parameters to be passed on to |
Shows posterior probabilities for observations belonging to all mixture components on the y-axis, with points ordered by posterior probability for main component.
Invisible matrix of posterior probabilities z
from
mclustsummary
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
require(mclust) require(MASS) data(crabs) dc <- crabs[,4:8] cm <- mclustBIC(crabs[,4:8],G=9,modelNames="EEE") scm <- summary(cm,crabs[,4:8]) weightplots(scm$z,clusternumbers=1:3,ask=FALSE) weightplots(scm$z,clusternumbers=1:3,allcol=1:9, ask=FALSE, legendposition=c(5,0.7)) # Remove ask=FALSE to have time to watch the plots.
require(mclust) require(MASS) data(crabs) dc <- crabs[,4:8] cm <- mclustBIC(crabs[,4:8],G=9,modelNames="EEE") scm <- summary(cm,crabs[,4:8]) weightplots(scm$z,clusternumbers=1:3,ask=FALSE) weightplots(scm$z,clusternumbers=1:3,allcol=1:9, ask=FALSE, legendposition=c(5,0.7)) # Remove ask=FALSE to have time to watch the plots.
Function of the elements of md
, which is 1 for arguments smaller
than ca
, 0 for arguments larger than ca2
and linear
(default: continuous) in between.
Thought for use in fixmahal
.
wfu(md, ca, ca2, a1 = 1/(ca - ca2), a0 = -a1 * ca2)
wfu(md, ca, ca2, a1 = 1/(ca - ca2), a0 = -a1 * ca2)
md |
vector of positive numericals. |
ca |
positive numerical. |
ca2 |
positive numerical. |
a1 |
numerical. Slope. |
a0 |
numerical. Intercept. |
A vector of numericals between 0 and 1.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
md <- seq(0,10,by=0.1) round(wfu(md,ca=5,ca2=8),digits=2)
md <- seq(0,10,by=0.1) round(wfu(md,ca=5,ca2=8),digits=2)
This produces a crosstable between two integer vectors (partitions) of
the same length with a given maximum vector entry k
so that the
size of the table is k*k
with zeroes for missing entries
between 1 and k
(the command table
does pretty
much the same thing but will leave out missing entries).
xtable(c1,c2,k)
xtable(c1,c2,k)
c1 |
vector of integers. |
c2 |
vector of integers of same length as |
k |
integer. Must be larger or equal to maximum entry in
|
A matrix of dimensions c(k,k)
. Entry [i,j]
gives the
number of places in which c1==i & c2==j
.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
c1 <- 1:3 c2 <- c(1,1,2) xtable(c1,c2,3)
c1 <- 1:3 c2 <- c(1,1,2) xtable(c1,c2,3)
Matrix of misclassification probabilities in a mixture distribution between two mixture components from estimated posterior probabilities regardless of component parameters, see Hennig (2010).
zmisclassification.matrix(z,pro=NULL,clustering=NULL, ipairs="all",symmetric=TRUE, stat="max")
zmisclassification.matrix(z,pro=NULL,clustering=NULL, ipairs="all",symmetric=TRUE, stat="max")
z |
matrix of posterior probabilities for observations (rows) to belong to mixture components (columns), so entries need to sum up to 1 for each row. |
pro |
vector of component proportions, need to sum up to
1. Computed from |
clustering |
vector of integers giving the estimated mixture
components for every observation. Computed from |
ipairs |
|
symmetric |
logical. If |
stat |
|
A matrix with the (symmetrised, if required) misclassification
probabilities between each pair of mixture components. If
symmetric=FALSE
, matrix entry [i,j]
is the estimated
probability that an observation generated by component
j
is classified to component i
by maximum a posteriori rule.
Christian Hennig [email protected] https://www.unibo.it/sitoweb/christian.hennig/en/
Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.
set.seed(12345) m <- rpois(20,lambda=5) dim(m) <- c(5,4) m <- m/apply(m,1,sum) round(zmisclassification.matrix(m,symmetric=FALSE),digits=2)
set.seed(12345) m <- rpois(20,lambda=5) dim(m) <- c(5,4) m <- m/apply(m,1,sum) round(zmisclassification.matrix(m,symmetric=FALSE),digits=2)