Title: | Convex Clustering Methods and Clustering Indexes |
---|---|
Description: | Convex Clustering methods, including K-means algorithm, On-line Update algorithm (Hard Competitive Learning) and Neural Gas algorithm (Soft Competitive Learning), and calculation of several indexes for finding the number of clusters in a data set. |
Authors: | Evgenia Dimitriadou [aut], Kurt Hornik [ctb, cre] |
Maintainer: | Kurt Hornik <[email protected]> |
License: | GPL-2 |
Version: | 0.6-26 |
Built: | 2024-10-29 06:18:37 UTC |
Source: | CRAN |
The data given by x
is clustered by an algorithm.
If centers
is a matrix, its rows are taken as the initial
cluster centers. If centers
is an integer, centers
rows
of x
are randomly chosen as initial values.
The algorithm stops, if no cluster center has changed during the last
iteration or the maximum number of iterations (given by
iter.max
) is reached.
If verbose
is TRUE
, only for "kmeans"
method,
displays for each iteration the number of the iteration and the
numbers of cluster indices which have changed since the last iteration
is given.
If dist
is "euclidean"
, the distance between the cluster
center and the data points is the Euclidian distance (ordinary kmeans
algorithm). If "manhattan"
, the distance between the cluster
center and the data points is the sum of the absolute values of the
distances of the coordinates.
If method
is "kmeans"
, then we have the kmeans
clustering method, which works by repeatedly moving all cluster
centers to the mean of their Voronoi sets. If "hardcl"
we have
the On-line Update (Hard Competitive learning) method, which works by
performing an update directly after each input signal, and if
"neuralgas"
we have the Neural Gas (Soft Competitive learning)
method, that sorts for each input signal the units of the network
according to the distance of their reference vectors to input signal.
If rate.method
is "polynomial"
, the polynomial learning
rate is used, that means , where
stands for the
number of input data for which a particular cluster has been the
winner so far. If
"exponentially decaying"
, the exponential
decaying learning rate is used according to
where
and
are the initial and final values of
the learning rate.
The parameters rate.par
of the learning rate, where
if rate.method
is "polynomial"
then by default
rate.par=1.0
, otherwise rate.par=(0.5,1e-5)
.
cclust (x, centers, iter.max=100, verbose=FALSE, dist="euclidean", method= "kmeans", rate.method="polynomial", rate.par=NULL)
cclust (x, centers, iter.max=100, verbose=FALSE, dist="euclidean", method= "kmeans", rate.method="polynomial", rate.par=NULL)
x |
Data matrix where columns correspond to variables and rows to observations |
centers |
Number of clusters or initial values for cluster centers |
iter.max |
Maximum number of iterations |
verbose |
If |
dist |
If |
method |
If |
rate.method |
If |
rate.par |
The parameters of the learning rate. |
cclust
returns an object of class "cclust"
.
centers |
The final cluster centers. |
initcenters |
The initial cluster centers. |
ncenters |
The number of the centers. |
cluster |
Vector containing the indices of the clusters where the data points are assigned to. |
size |
The number of data points in each cluster. |
iter |
The number of iterations performed. |
changes |
The number of changes performed in each iteration step with the Kmeans algorithm. |
dist |
The distance measure used. |
method |
The algorithm method being used. |
rate.method |
The learning rate being used by the Hardcl clustering method. |
rate.par |
The parameters of the learning rate. |
call |
Returns a call in which all of the arguments are specified by their names. |
withinss |
Returns the sum of square distances within the clusters. |
Evgenia Dimitriadou
## a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) ## a 3-dimensional example x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=1,sd=0.3),ncol=3), matrix(rnorm(150,mean=2,sd=0.3),ncol=3)) cl<-cclust(x,6,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) ## assign classes to some new data y<-rbind(matrix(rnorm(33,sd=0.3),ncol=3), matrix(rnorm(33,mean=1,sd=0.3),ncol=3), matrix(rnorm(3,mean=2,sd=0.3),ncol=3)) ycl<-predict(cl, y) plot(y, col=ycl$cluster)
## a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) ## a 3-dimensional example x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=1,sd=0.3),ncol=3), matrix(rnorm(150,mean=2,sd=0.3),ncol=3)) cl<-cclust(x,6,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) ## assign classes to some new data y<-rbind(matrix(rnorm(33,sd=0.3),ncol=3), matrix(rnorm(33,mean=1,sd=0.3),ncol=3), matrix(rnorm(3,mean=2,sd=0.3),ncol=3)) ycl<-predict(cl, y) plot(y, col=ycl$cluster)
y
is the result of a clustering algorithm of class such
as "cclust"
.
This function is calculating the values of several clustering
indexes. The values of the indexes can be independently used in order
to determine the number of clusters existing in a data set.
clustIndex ( y, x, index = "all" )
clustIndex ( y, x, index = "all" )
y |
Object of class |
x |
Data matrix where columns correspond to variables and rows to observations |
index |
The indexes that are calculated |
The description of the indexes is categorized into 3 groups, based on the statistics mainly used to compute them.
The first group is based on the sum of squares within ()
and between (
) the clusters. These statistics measure the
dispersion of the data points in a cluster and between the clusters
respectively. These indexes are:
, where
is the number of
data points and
is the number of clusters.
then .
, where
stands
for the
for every variable and
for the
total sum of squares for every variable.
, where
is the number of clusters.
The second group is based on the statistics of , i.e., the
scatter matrix of the data points, and
, which is the sum of the
scatter matrices in every group. These indexes are:
, where
is the number of data points
and
stands for the determinant of a matrix.
, where
is the number of clusters.
.
.
, where
is the scatter matrix of
the cluster centers.
.
The third group consists of four algorithms not belonging to the previous ones and not having anything in common.
if the data set is binary, then while the C-Index is a cluster
similarity measure, is expressed as:,
where
is the sum of all
within
cluster distances,
is the sum of the
smallest pairwise distances in the data set, and
is the sum of the
biggest
pairwise distances. In order to compute the C-Index all
pairwise distances in the data set have to be computed and
stored. In the case of binary data, the storage of the
distances is creating no problems since there are only a few
possible distances. However, the computation of all distances
can make this index prohibitive for large data sets.
where
stands for the maximum value of
for
, and
for
, where
is the
distance between the centers of two clusters
.
under the assumption of
independence of the variables within a cluster, a cluster solution
can be regarded as a mixture model for the data, where the cluster
centers give the probabilities for each variable to be
. Therefore, the negative Log-likelihood can be computed and
used as a quantity measure for a cluster solution. Note that the
assumptions for applying special penalty terms, like in AIC or BIC,
are not fulfilled in this model, and also they show no effect for
these data sets.
this “Simple Structure Index”
combines three elements which influence the interpretability of a
solution, i.e., the maximum difference of each variable between the
clusters, the sizes of the most contrasting clusters and the
deviation of a variable in the cluster centers compared to its
overall mean. These three elements are multiplicatively combined and
normalized to give a value between and
.
Returns an vector with the indexes values.
Evgenia Dimitriadou and Andreas Weingessel
Andreas Weingessel, Evgenia Dimitriadou and Sara Dolnicar,
An Examination Of Indexes For Determining The Number
Of Clusters In Binary Data Sets,
https://epub.wu.ac.at/1542/
and the references therein.
# a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") resultindexes <- clustIndex(cl,x, index="all") resultindexes
# a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") resultindexes <- clustIndex(cl,x, index="all") resultindexes
Assigns each data point (row in newdata
) the cluster corresponding to
the closest center found in object
.
## S3 method for class 'cclust' predict(object, newdata, ...)
## S3 method for class 'cclust' predict(object, newdata, ...)
object |
Object of class |
newdata |
Data matrix where columns correspond to variables and rows to observations |
... |
currently not used |
predict.cclust
returns an object of class "cclust"
.
Only size
is changed as compared to the argument
object
.
cluster |
Vector containing the indices of the clusters where the data is mapped. |
size |
The number of data points in each cluster. |
Evgenia Dimitriadou
# a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) # a 3-dimensional example x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=1,sd=0.3),ncol=3), matrix(rnorm(150,mean=2,sd=0.3),ncol=3)) cl<-cclust(x,6,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) # assign classes to some new data y<-rbind(matrix(rnorm(33,sd=0.3),ncol=3), matrix(rnorm(33,mean=1,sd=0.3),ncol=3), matrix(rnorm(3,mean=2,sd=0.3),ncol=3)) ycl<-predict(cl, y) plot(y, col=ycl$cluster)
# a 2-dimensional example x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2), matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) cl<-cclust(x,2,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) # a 3-dimensional example x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3), matrix(rnorm(150,mean=1,sd=0.3),ncol=3), matrix(rnorm(150,mean=2,sd=0.3),ncol=3)) cl<-cclust(x,6,20,verbose=TRUE,method="kmeans") plot(x, col=cl$cluster) # assign classes to some new data y<-rbind(matrix(rnorm(33,sd=0.3),ncol=3), matrix(rnorm(33,mean=1,sd=0.3),ncol=3), matrix(rnorm(3,mean=2,sd=0.3),ncol=3)) ycl<-predict(cl, y) plot(y, col=ycl$cluster)