Title: | Determining the Best Number of Clusters in a Data Set |
---|---|
Description: | It provides 30 indexes for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user. |
Authors: | Malika Charrad and Nadia Ghazzali and Veronique Boiteau and Azam Niknafs |
Maintainer: | Malika Charrad <[email protected]> |
License: | GPL-2 |
Version: | 3.0.1 |
Built: | 2024-10-31 21:24:46 UTC |
Source: | CRAN |
NbClust
package provides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
NbClust(data = NULL, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = NULL, index = "all", alphaBeale = 0.1)
NbClust(data = NULL, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = NULL, index = "all", alphaBeale = 0.1)
data |
matrix or dataset. |
diss |
dissimilarity matrix to be used. By default, |
distance |
the distance measure to be used to compute the dissimilarity matrix. This must be one of: "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski" or "NULL". By default, distance="euclidean". If the distance is "NULL", the dissimilarity matrix (diss) should be given by the user. If distance is not "NULL", the dissimilarity matrix should be "NULL". |
min.nc |
minimal number of clusters, between 1 and (number of objects - 1) |
max.nc |
maximal number of clusters, between 2 and (number of objects - 1), greater or equal to min.nc. By default, max.nc=15. |
method |
the cluster analysis method to be used. This should be one of: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid", "kmeans". |
index |
the index to be calculated. This should be one of : "kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", "ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", "dindex", "sdbw", "all" (all indices except GAP, Gamma, Gplus and Tau), "alllong" (all indices with Gap, Gamma, Gplus and Tau included). |
alphaBeale |
significance value for Beale's index. |
Notes on the "Distance" argument
The following distance measures are written for two vectors x and y. They are used when the data is a d-dimensional vector arising from measuring d characteristics on each of n objects or individuals.
Euclidean distance : Usual square distance between the two vectors (2 norm).
Maximum distance: Maximum distance between two components of x and y (supremum norm).
Manhattan distance : Absolute distance between the two vectors (1 norm).
Canberra distance : Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
Binary distance : The vectors are regarded as binary bits, so non-zero elements are "on" and zero elements are "off". The distance is the proportion of bits in which only one is on amongst those in which at least one is on.
Minkowski distance : The p norm, the root of the sum of the
powers of the differences of the components.
Notes on the "method" argument
The following aggregation methods are available in this package.
Ward : Ward method minimizes the total within-cluster variance. At each step the pair of clusters with minimum cluster distance are merged. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2013). With the latter, the dissimilarities are squared before cluster updating.
Single : The distance between two clusters
and
is the minimum distance between two points
and
, with
.
A drawback of this method is the so-called chaining phenomenon: clusters may be forced together due to single elements being close to each other, even though many of the elements in each cluster may be very distant to each other.
Complete : The distance between two clusters
and
is the maximum distance between two points
and
, with
.
Average : The distance between two clusters
and
is the mean of the distances between the pair of points x and y, where
.
where and
are respectively the number of elements in clusters
and
.
This method has the tendency to form clusters with the same variance and, in particular, small variance.
McQuitty : The distance between clusters and
is the weighted mean of the between-cluster dissimilarities:
where cluster is formed from the aggregation of clusters
and
.
Median : The distance between two clusters
and
is given by the following formula:
where cluster is formed by the aggregation of clusters
and
.
Centroid : The distance between two clusters
and
is the squared euclidean distance between the gravity centers of the two clusters, i.e. between the mean vectors of the two clusters,
and
respectively.
This method is more robust than others in terms of isolated points.
Kmeans : This method is said to be a reallocation method. Here is the general principle:
Select as many points as the number of desired clusters to create initial centers.
Each observation is then associated with the nearest center to create temporary clusters.
The gravity centers of each temporary cluster is calculated and these become the new clusters centers.
Each observation is reallocated to the cluster which has the closest center.
This procedure is iterated until convergence.
Notes on the "Index" argument
The table below summarizes indices implemented in NbClust and the criteria used to select the optimal number of clusters.
Index in NbClust | Optimal number of clusters |
1. "kl" or "all" or "alllong" | Maximum value of the index |
(Krzanowski and Lai 1988) | |
2. "ch" or "all" or "alllong" | Maximum value of the index |
(Calinski and Harabasz 1974) | |
3. "hartigan" or "all" or "alllong" | Maximum difference between |
(Hartigan 1975) | hierarchy levels of the index |
4. "ccc" or "all" or "alllong" | Maximum value of the index |
(Sarle 1983) | |
5. "scott" or "all" or "alllong" | Maximum difference between |
(Scott and Symons 1971) | hierarchy levels of the index |
6. "marriot" or "all" or "alllong" | Max. value of second differences |
(Marriot 1971) | between levels of the index |
7. "trcovw" or "all" or "alllong" | Maximum difference between |
(Milligan and Cooper 1985) | hierarchy levels of the index |
8. "tracew" or "all" or "alllong" | Maximum value of absolute second |
(Milligan and Cooper 1985) | differences between levels of the index |
9. "friedman" or "all" or "alllong" | Maximum difference between |
(Friedman and Rubin 1967) | hierarchy levels of the index |
10. "rubin" or "all" or "alllong" | Minimum value of second differences |
(Friedman and Rubin 1967) | between levels of the index |
11. "cindex" or "all" or "alllong" | Minimum value of the index |
(Hubert and Levin 1976) | |
12. "db" or "all" or "alllong" | Minimum value of the index |
(Davies and Bouldin 1979) | |
13. "silhouette" or "all" or "alllong" | Maximum value of the index |
(Rousseeuw 1987) | |
14. "duda" or "all" or "alllong" | Smallest such that index > criticalValue |
(Duda and Hart 1973) | |
15. "pseudot2" or "all" or "alllong" | Smallest such that index < criticalValue |
(Duda and Hart 1973) | |
16. "beale" or "all" or "alllong" | such that critical value of the index >= alpha |
(Beale 1969) | |
17. "ratkowsky" or "all" or "alllong" | Maximum value of the index |
(Ratkowsky and Lance 1978) | |
18. "ball" or "all" or "alllong" | Maximum difference between hierarchy |
(Ball and Hall 1965) | levels of the index |
19. "ptbiserial" or "all" or "alllong" | Maximum value of the index |
(Milligan 1980, 1981) | |
20. "gap" or "alllong" | Smallest such that criticalValue >= 0 |
(Tibshirani et al. 2001) | |
21. "frey" or "all" or "alllong" | the cluster level before that index value < 1.00 |
(Frey and Van Groenewoud 1972) | |
22. "mcclain" or "all" or "alllong" | Minimum value of the index |
(McClain and Rao 1975) | |
23. "gamma" or "alllong" | Maximum value of the index |
(Baker and Hubert 1975) | |
24. "gplus" or "alllong" | Minimum value of the index |
(Rohlf 1974) (Milligan 1981) | |
25. "tau" or "alllong" | Maximum value of the index |
(Rohlf 1974) (Milligan 1981) | |
26. "dunn" or "all" or "alllong" | Maximum value of the index |
(Dunn 1974) | |
27. "hubert" or "all" or "alllong" | Graphical method |
(Hubert and Arabie 1985) | |
28. "sdindex" or "all" or "alllong" | Minimum value of the index |
(Halkidi et al. 2000) | |
29. "dindex" or "all" or "alllong" | Graphical method |
(Lebart et al. 2000) | |
30. "sdbw" or "all" or "alllong" | Minimum value of the index |
(Halkidi and Vazirgiannis 2001) | |
All.index |
Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc. |
All.CriticalValues |
Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc. |
Best.nc |
Best number of clusters proposed by each index and the corresponding index value. |
Best.partition |
Partition that corresponds to the best number of clusters |
Malika Charrad, Nadia Ghazzali, Veronique Boiteau and Azam Niknafs
Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.",
"Journal of Statistical Software, 61(6), 1-36.", "URL http://www.jstatsoft.org/v61/i06/".
## DATA MATRIX IS GIVEN ## A 2-dimensional example set.seed(1) x<-rbind(matrix(rnorm(100,sd=0.1),ncol=2), matrix(rnorm(100,mean=1,sd=0.2),ncol=2), matrix(rnorm(100,mean=5,sd=0.1),ncol=2), matrix(rnorm(100,mean=7,sd=0.2),ncol=2)) res<-NbClust(x, distance = "euclidean", min.nc=2, max.nc=8, method = "complete", index = "ch") res$All.index res$Best.nc res$Best.partition ## A 5-dimensional example set.seed(1) x<-rbind(matrix(rnorm(150,sd=0.3),ncol=5), matrix(rnorm(150,mean=3,sd=0.2),ncol=5), matrix(rnorm(150,mean=1,sd=0.1),ncol=5), matrix(rnorm(150,mean=6,sd=0.3),ncol=5), matrix(rnorm(150,mean=9,sd=0.3),ncol=5)) res<-NbClust(x, distance = "euclidean", min.nc=2, max.nc=10, method = "ward.D", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## A real data example data<-iris[,-c(5)] res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6, method = "ward.D2", index = "kl") res$All.index res$Best.nc res$Best.partition res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6, method = "kmeans", index = "hubert") res$All.index res<-NbClust(data, diss=NULL, distance = "manhattan", min.nc=2, max.nc=6, method = "complete", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## Examples with a dissimilarity matrix ## Data matrix is given set.seed(1) x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=3,sd=0.2),ncol=3), matrix(rnorm(150,mean=5,sd=0.3),ncol=3)) diss_matrix<- dist(x, method = "euclidean", diag=FALSE) res<-NbClust(x, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D", index = "ch") res$All.index res$Best.nc res$Best.partition data<-iris[,-c(5)] diss_matrix<- dist(data, method = "euclidean", diag=FALSE) NbClust(data, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition set.seed(1) x<-rbind(matrix(rnorm(20,sd=0.1),ncol=2), matrix(rnorm(20,mean=1,sd=0.2),ncol=2), matrix(rnorm(20,mean=5,sd=0.1),ncol=2), matrix(rnorm(20,mean=7,sd=0.2),ncol=2)) diss_matrix<- dist(x, method = "euclidean", diag=FALSE) res<-NbClust(x, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "alllong") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## Data matrix is not available. Only the dissimilarity matrix is given ## In this case, only these indices can be computed : frey, mcclain, cindex, silhouette and dunn res<-NbClust(diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "silhouette") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition
## DATA MATRIX IS GIVEN ## A 2-dimensional example set.seed(1) x<-rbind(matrix(rnorm(100,sd=0.1),ncol=2), matrix(rnorm(100,mean=1,sd=0.2),ncol=2), matrix(rnorm(100,mean=5,sd=0.1),ncol=2), matrix(rnorm(100,mean=7,sd=0.2),ncol=2)) res<-NbClust(x, distance = "euclidean", min.nc=2, max.nc=8, method = "complete", index = "ch") res$All.index res$Best.nc res$Best.partition ## A 5-dimensional example set.seed(1) x<-rbind(matrix(rnorm(150,sd=0.3),ncol=5), matrix(rnorm(150,mean=3,sd=0.2),ncol=5), matrix(rnorm(150,mean=1,sd=0.1),ncol=5), matrix(rnorm(150,mean=6,sd=0.3),ncol=5), matrix(rnorm(150,mean=9,sd=0.3),ncol=5)) res<-NbClust(x, distance = "euclidean", min.nc=2, max.nc=10, method = "ward.D", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## A real data example data<-iris[,-c(5)] res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6, method = "ward.D2", index = "kl") res$All.index res$Best.nc res$Best.partition res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6, method = "kmeans", index = "hubert") res$All.index res<-NbClust(data, diss=NULL, distance = "manhattan", min.nc=2, max.nc=6, method = "complete", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## Examples with a dissimilarity matrix ## Data matrix is given set.seed(1) x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=3,sd=0.2),ncol=3), matrix(rnorm(150,mean=5,sd=0.3),ncol=3)) diss_matrix<- dist(x, method = "euclidean", diag=FALSE) res<-NbClust(x, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D", index = "ch") res$All.index res$Best.nc res$Best.partition data<-iris[,-c(5)] diss_matrix<- dist(data, method = "euclidean", diag=FALSE) NbClust(data, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "all") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition set.seed(1) x<-rbind(matrix(rnorm(20,sd=0.1),ncol=2), matrix(rnorm(20,mean=1,sd=0.2),ncol=2), matrix(rnorm(20,mean=5,sd=0.1),ncol=2), matrix(rnorm(20,mean=7,sd=0.2),ncol=2)) diss_matrix<- dist(x, method = "euclidean", diag=FALSE) res<-NbClust(x, diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "alllong") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition ## Data matrix is not available. Only the dissimilarity matrix is given ## In this case, only these indices can be computed : frey, mcclain, cindex, silhouette and dunn res<-NbClust(diss=diss_matrix, distance = NULL, min.nc=2, max.nc=6, method = "ward.D2", index = "silhouette") res$All.index res$Best.nc res$All.CriticalValues res$Best.partition