Title: | Fundamental Clustering Problems Suite |
---|---|
Description: | Over sixty clustering algorithms are provided in this package with consistent input and output, which enables the user to try out algorithms swiftly. Additionally, 26 statistical approaches for the estimation of the number of clusters as well as the mirrored density plot (MD-plot) of clusterability are implemented. The packages is published in Thrun, M.C., Stier Q.: "Fundamental Clustering Algorithms Suite" (2021), SoftwareX, <DOI:10.1016/j.softx.2020.100642>. Moreover, the fundamental clustering problems suite (FCPS) offers a variety of clustering challenges any algorithm should handle when facing real world data, see Thrun, M.C., Ultsch A.: "Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems" (2020), Data in Brief, <DOI:10.1016/j.dib.2020.105501>. |
Authors: | Michael Thrun [aut, cre, cph] , Peter Nahrgang [ctr, ctb], Felix Pape [ctr, ctb], Vasyl Pihur [ctb], Guy Brock [ctb], Susmita Datta [ctb], Somnath Datta [ctb], Luis Winckelmann [com], Alfred Ultsch [dtc, ctb], Quirin Stier [ctb, rev] |
Maintainer: | Michael Thrun <[email protected]> |
License: | GPL-3 |
Version: | 1.3.4 |
Built: | 2024-12-12 07:17:20 UTC |
Source: | CRAN |
Over sixty clustering algorithms are provided in this package with consistent input and output, which enables the user to try out algorithms swiftly. Additionally, 26 statistical approaches for the estimation of the number of clusters as well as the mirrored density plot (MD-plot) of clusterability are implemented. The packages is published in Thrun, M.C., Stier Q.: "Fundamental Clustering Algorithms Suite" (2021), SoftwareX, <DOI:10.1016/j.softx.2020.100642>. Moreover, the fundamental clustering problems suite (FCPS) offers a variety of clustering challenges any algorithm should handle when facing real world data, see Thrun, M.C., Ultsch A.: "Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems" (2020), Data in Brief, <DOI:10.1016/j.dib.2020.105501>.
The package consists of many algorithms and fundamental datasets for clustering published in [Thrun/Stier, 2021]. Originally, the 'Fundamental Clustering Problems Suite' (FCPS) offered a variety of clustering problems any algorithm shall be able to handle when facing real world data. Nine of the here presented artificial datasets were priorly named FCPS with a fixed sample size in Ultsch, A.: "Clustering with SOM: U*C", In Workshop on Self-Organizing Maps, 2005. FCPS often served in the paper as an elementary benchmark for clustering algorithms. The FCPS package extends datasets, enables variable sample sizes for these datasets, and provides a standardized and easy access to many clustering algorithms.
FCPS datasets consists of data sets with known a priori classification to be reproduced by the algorithms. All data sets are intentionally created to be simple and might be visualized in two or three dimensions. Each data sets represents a certain problem that is solved by known clustering algorithms with varying success. This is done in order to reveal benefits and shortcomings of algorithms in question. Standard clustering methods, e.g. single-linkage, ward and k-means, are not able to solve all FCPS problems satisfactorily. "Lsun3D and each of the nine artificial data sets of "Fundamental Clustering Problems Suite" (FCPS) were defined separately for a specific clustering problem as cited (in [Thrun/Ultsch, 2020]). The original sample size defined in the respective first publication mentioning the data was used in [Thrun/Ultsch, 2020], but using the R function "ClusterChallenge" (...) any sample size can be drawn for all artificial data sets. [Thrun/Ultsch, 2020]
Index: This package was not yet installed at build time.
Michael Thrun [aut, cre, cph] (<https://orcid.org/0000-0001-9542-5543>), Peter Nahrgang [ctr, ctb], Felix Pape [ctr, ctb], Vasyl Pihur [ctb], Guy Brock [ctb], Susmita Datta [ctb], Somnath Datta [ctb], Luis Winckelmann [com], Alfred Ultsch [dtc, ctb], Quirin Stier [ctb, rev]
Maintainer: Michael Thrun <[email protected]>
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
[Thrun/Stier, 2021] Thrun, M. C., & Stier, Q.: Fundamental Clustering Algorithms Suite SoftwareX, Vol. 13(C), in press, pp. 100642. doi:10.1016/j.softx.2020.100642, 2021.
[Ultsch, 2005] Ultsch, A.: Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, pp. 75-82, Paris, France, 2005.
The algorithm was introduced in [Rodriguez/Laio, 2014] and here implemented by [Wang/Xu, 2017]. The algorithm is adaptive in the sense that only ClusterNo
has to be set instead of the paramters of [Rodriguez/Laio, 2014] implemented in ADPclustering
.
ADPclustering(Data,ClusterNo=NULL,PlotIt=FALSE,...)
ADPclustering(Data,ClusterNo=NULL,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
Optional, either: A number k which defines k different Clusters to be build by the algorithm, or a range of |
PlotIt |
default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
The ADP algorithm decides the k number of clusters. This is contrary to the other version of the algorithm from another package which can be called with DensityPeakClustering
.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Rodriguez/Laio, 2014] Rodriguez, A., & Laio, A.: Clustering by fast search and find of density peaks, Science, Vol. 344(6191), pp. 1492-1496. 2014.
[Wang/Xu, 2017] Wang, X.-F., & Xu, Y.: Fast clustering using adaptive density peak detection, Statistical methods in medical research, Vol. 26(6), pp. 2800-2811. 2017.
data('Hepta') out=ADPclustering(Hepta$Data,PlotIt=FALSE)
data('Hepta') out=ADPclustering(Hepta$Data,PlotIt=FALSE)
Agglomerative hierarchical clustering (AGNES)of [Rousseeuw/Kaufman, 1990, pp. 199-252]
AgglomerativeNestingClustering(DataOrDistances, ClusterNo, PlotIt = FALSE, Standardization = TRUE, ...)
AgglomerativeNestingClustering(DataOrDistances, ClusterNo, PlotIt = FALSE, Standardization = TRUE, ...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm.
if |
PlotIt |
Default: FALSE if |
Standardization |
|
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, doi 10.1002/9780470316801, Online ISBN: 9780470316801, 1990.
[Struyf et al., 1996] Struyf,A., Hubert, M. and Rousseeuw, Peter J.: Clustering in an Object-Oriented Environment, Journal of Statistical Software, Vol. 1, doi: 10.18637/jss.v001.i04, 1996.
[Struyf et al., 1997] Struyf, A., Hubert, M. and Rousseeuw, P.J.: Integrating Robust Clustering Techniques in S-PLUS, Computational Statistics and Data Analysis, Vol. 26, pp. 17–37, 1997.
data('Hepta') CA=AgglomerativeNestingClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) ## Not run: ClusterDendrogram(CA$Dendrogram,7,main='AGNES clustering') print(CA$Object) plot(CA$Object) ## End(Not run)
data('Hepta') CA=AgglomerativeNestingClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) ## Not run: ClusterDendrogram(CA$Dendrogram,7,main='AGNES clustering') print(CA$Object) plot(CA$Object) ## End(Not run)
Affinity propagation clustering published by [Frey/Dueck, 2007] and implemented by [Bodenhofer et al., 2011].
APclustering(DataOrDistances, InputPreference=NA,ExemplarPreferences=NA, DistanceMethod="euclidean", Seed=7568,PlotIt=FALSE,Data,...)
APclustering(DataOrDistances, InputPreference=NA,ExemplarPreferences=NA, DistanceMethod="euclidean", Seed=7568,PlotIt=FALSE,Data,...)
DataOrDistances |
[1:n,1:d] with: if d=n and symmetric then distance matrix assumed, otherwise: [1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. In the latter case the Euclidean distances will be calculated. |
InputPreference |
Default parameter set, see apcluster |
ExemplarPreferences |
Default parameter set, see apcluster |
DistanceMethod |
DistanceMethod as in |
Seed |
Set as integervalue to have reproducible results, see apcluster |
PlotIt |
Default: FALSE, If TRUE and dataset of [1:n,1:d] dimensions then a plot of the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Data |
[1:n,1:d] data matrix in the case that |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
Distancematrix D is converted to similarity matrix S with S=-(D^2).
If data matrix is used, then euclidean similarities are calculated by similarities
and a specifed distance method.
The AP algorithm decides the k number of clusters.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Frey/Dueck, 2007] Frey, B. J., & Dueck, D.: Clustering by passing messages between data points, Science, Vol. 315(5814), pp. 972-976, <doi:10.1126/science.1136800>, 2007.
[Bodenhofer et al., 2011] Bodenhofer, U., Kothmeier, A., & Hochreiter, S.: APCluster: an R package for affinity propagation clustering, Bioinformatics, Vol. 27(17), pp, 2463-2464, 2011.
Further details in http://www.bioinf.jku.at/software/apcluster/
apcluster
data('Hepta') res=APclustering(Hepta$Data, PlotIt = FALSE)
data('Hepta') res=APclustering(Hepta$Data, PlotIt = FALSE)
Two nested spheres with different variances that are not linear not separable. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Atom")
data("Atom")
Size 800, Dimensions 3, stored in Atom$Data
Classes 2, stored in Atom$Cls
[Ultsch, 2004] Ultsch, A.: Strategies for an artificial life system to cluster high dimensional data, Abstracting and Synthesizing the Principles of Living Systems, GWAL-6, U. Brggemann, H. Schaub, and F. Detje, Eds, pp. 128-137. 2004.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Atom) str(Atom)
data(Atom) str(Atom)
Projection-based clustering AutomaticProjectionBasedClustering
projects the data (nonlinear) into two dimensions and tries only to preserve relevant neighborhoods prior to clustering. The cluster analysis itself includes the high-dimensional distances in the clustering process. Performs non-interactive projection-based clustering based on non-linear projection methods [Thrun/Ultsch, 2017], [Thrun/Ultsch, 2020a].
AutomaticProjectionBasedClustering(DataOrDistances,ClusterNo,Type="NerV", StructureType = TRUE,PlotIt=FALSE,PlotTree=FALSE,PlotMap=FALSE,...)
AutomaticProjectionBasedClustering(DataOrDistances,ClusterNo,Type="NerV", StructureType = TRUE,PlotIt=FALSE,PlotTree=FALSE,PlotMap=FALSE,...)
DataOrDistances |
Either nonsymmetric [1:n,1:d] numerical matrix of a dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or symmetric [1:n,1:n] distance matrix, e.g. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
Type of Projection method, either
|
StructureType |
Either compact (TRUE) or connected (FALSE), see discussion in [Thrun, 2018] |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
PlotTree |
TRUE: Plots the dendrogram, FALSE: no plot |
PlotMap |
Plots the topographic map [Thrun et al., 2016]. |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
The first idea of using non-PCA projections for clustering was published by [Bock, 1987] as a definition. However, to the knowledge of the author, it was not applied to any data. The coexistence of projection and clustering was introduced in [Thrun/Ultsch, 2017].
Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space [Thrun/Ultsch, 2020b]. Typical projection-methods like t-distributed stochastic neighbor embedding (t-SNE) [Van der Maaten/Hinton, 2008], or neighbor retrieval visualizer (NerV) [Venna et al., 2010] are used project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data. In the next step, the Delaunay graph [Delaunay, 1934] between the projected points is calculated, and each vertex between two projected points is weighted with the high-dimensional distance between the corresponding high-dimensional data points. Thereafter the shortest path between every pair of points is computed using the Dijkstra algorithm [Dijkstra, 1959]. The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data [Thrun/Ultsch, 2020b]. This Boolean choice can be decided by looking at the topographic map of high-dimensional structures [Thrun/Ultsch, 2020a]. In a benchmarking of 34 comparable clustering methods, projection-based clustering was the only algorithm that always was able to find the high-dimensional distance or density-based structure of the dataset [Thrun/Ultsch, 2020b].
It should be noted that it is preferable to use a visualization for the Generalized U-Matrix like the topographic map plotTopographicMap
of [Thrun et al., 2016] to evaluate the choice of the boolean parameter StructureType
and the clustering, improve it or set the number of clusters appropriately. A comparison with 32 clustering algorithms showed that PBC is always able to find the correct cluster structure while the best of the 32 clustering algorithms varies depending on the dataset [Thrun/Ultsch, 2020].
The first systematic comparison to other DR clustering methods like Projection-Pursuit Methods ProjectionPursuitClustering
, supspace clustering methods SubspaceClustering
, and CA-based clustering methods can be found in [Thrun/Ultsch, 2020a]. For PCA-based clustering methods please see TandemClustering
.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. . Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Bock, 1987] Bock, H.: On the interface between cluster analysis, principal component analysis, and multidimensional scaling, Multivariate statistical modeling and data analysis, (pp. 17-34), Springer, 1987.
[Thrun/Ultsch, 2017] Thrun, M. C., & Ultsch, A.: Projection based Clustering, Proc. International Federation of Classification Societies (IFCS), pp. 250-251, Tokai University, Japanese Classification Society (JCS), Tokyo, Japan August 7-10, 2017.
[Thrun/Ultsch, 2020a] Thrun, M. C., & Ultsch, A.: Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, Journal of Classification, in press, doi 10.1007/s00357-020-09373-2, 2020.
[Thrun et al., 2016] Thrun, M. C., Lerch, F., Loetsch, J., & Ultsch, A.: Visualization and 3D Printing of Multivariate Data of Biomarkers, in Skala, V. (Ed.), International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Vol. 24, pp. 7-16, Plzen, http://wscg.zcu.cz/wscg2016/short/A43-full.pdf, 2016.
[McInnes et al., 2018] McInnes, L., Healy, J., & Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426, 2018.
[Demartines/Herault, 1995] Demartines, P., & Herault, J.: CCA:" Curvilinear component analysis", Proc. 15 Colloque sur le traitement du signal et des images, Vol. 199, GRETSI, Groupe d Etudes du Traitement du Signal et des Images, France 18-21 September, 1995.
[Sammon, 1969] Sammon, J. W.: A nonlinear mapping for data structure analysis, IEEE Transactions on computers, Vol. 18(5), pp. 401-409. doi doi:10.1109/t-c.1969.222678, 1969.
[Thrun/Ultsch, 2020b] Thrun, M. C., & Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Journal of Artificial Intelligence, Vol. in press, pp. doi 10.1016/j.artint.2020.103237, 2020.
[Torgerson, 1952] Torgerson, W. S.: Multidimensional scaling: I. Theory and method, Psychometrika, Vol. 17(4), pp. 401-419. 1952.
[Venna et al., 2010] Venna, J., Peltonen, J., Nybo, K., Aidos, H., & Kaski, S.: Information retrieval perspective to nonlinear dimensionality reduction for data visualization, The Journal of Machine Learning Research, Vol. 11, pp. 451-490. 2010.
[Van der Maaten/Hinton, 2008] Van der Maaten, L., & Hinton, G.: Visualizing Data using t-SNE, Journal of Machine Learning Research, Vol. 9(11), pp. 2579-2605. 2008.
data('Hepta') out=AutomaticProjectionBasedClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=AutomaticProjectionBasedClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Two chains of rings. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Chainlink")
data("Chainlink")
Size 1000, Dimensions 3, stored in Chainlink$Data
Classes 2, stored in Chainlink$Cls
[Ultsch et al., 1994] Ultsch, A., Guimaraes, G., Korus, D., & Li, H.: Knowledge extraction from artificial neural networks and applications, Parallele Datenverarbeitung mit dem Transputer, (pp. 148-162), Springer, 1994.
[Ultsch, 1995] Ultsch, A.: Self organizing neural networks perform different from statistical k-means clustering, Proc. Society for Information and Classification (GFKL), Vol. 1995, Basel 8th-10th March 1995.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Chainlink) str(Chainlink)
data(Chainlink) str(Chainlink)
Clusterability mirrored-density plot. Clusterability aims to quantify the degree of cluster structures [Adolfsson et al., 2019]. A dataset has a high probabilty to possess cluster structures, if the first component of the PCA projection is multimodal [Adolfsson et al., 2019]. As the dip test is less exact than the MDplot [Thrun et al., 2020] , pvalues above 0.05 can be given for MDplots which are clearly multimodal.
An alternative investigation of clusterability can be performed by inspecting the topographic map of the Generalized U-Matrix for a specfic projection method using the ProjectionBasesdClustering and GeneralizedUmatrix packages on CRAN, see [Thrun/Ultsch, 2021] for details.
ClusterabilityMDplot(DataOrDistance,Method, na.rm=FALSE,PlotIt=TRUE,...)
ClusterabilityMDplot(DataOrDistance,Method, na.rm=FALSE,PlotIt=TRUE,...)
DataOrDistance |
Either a dataset[1:n,1:d] of n cases and d features or a symmetric distance matrix [1:d,1:d] or multiple data sets or distances in a list |
Method |
"none" performs no dimension reduction. "pca" uses the scores from the first principal component. "distance" computes pairwise distances (using distance_metric as the metric). |
na.rm |
Statistical testing will not work with missing values, if TRUE values are imputed with averages |
PlotIt |
TRUE: print plot, otherwise do not plot directly, instead use |
... |
Further arguments for function |
Use the method of [Adolfsson et al., 2019] specified as pca plus dip-test (PCA dip) per default without scaling or standardization of data because this step should never be done automatically. In [Thrun, 2020] the standardization and scaling did not improve the results.
If list is named, than the names of the list will be used and the MDplots will be re-ordered according to multimodality in the plot, otherwise only the pvalues of [Adolfsson et al., 2019] will be the names and the ordering of the MDplots is the same as the list.
Beware, as shown below, this test fails for almost touching clusters of Tetra and is difficult to intepret on WingNut but with overlayed with a roubustly estimated unimodal Gaussian distribution it can be interpreted as multimodal). However, it does not fail for chaining data contrary to the claim in [Adolfsson et al., 2019].
Based on [Thrun, 2020], the author of this function disagrees with [Adolfsson et al., 2019] as to the preference which clusterablity method should be used because the approach "distance" is not preferable for density-based cluster structures.
List of
Handle |
GGobject, plotter handle of ggplot2 |
Pvalue |
One or more p-values of dip test depending on |
"none" seems to call dip.test in clusterabilitytest with high-dimensional data. In that case dip.test just vectorizes the matrix of the data which does not make any sense. Since this could be a bug, the "none" option should not be used.
Imputation does not work for distance matrices. Imputation is still experimental. It is adviced to impute missing values before using this function
Michael Thrun
[Adolfsson et al., 2019] Adolfsson, A., Ackerman, M., & Brownstein, N. C.: To cluster, or not to cluster: An analysis of clusterability methods, Pattern Recognition, Vol. 88, pp. 13-26. 2019.
[Thrun et al., 2020] Thrun, M. C., Gehlert, T. & Ultsch, A.: Analyzing the Fine Structure of Distributions, PLoS ONE, Vol. 15(10), pp. 1-66, DOI doi:10.1371/journal.pone.0238835, 2020.
[Thrun/Ultsch, 2021] Thrun, M. C., and Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Artificial Intelligence, Vol. 290, pp. 103237, doi:10.1016/j.artint.2020.103237, 2021.
[Thrun, 2020] Thrun, M. C.: Improving the Sensitivity of Statistical Testing for Clusterability with Mirrored-Density Plot, in Archambault, D., Nabney, I. & Peltonen, J. (eds.), Machine Learning Methods in Visualisation for Big Data, The Eurographics Association, https://diglib.eg.org:443/handle/10.2312/mlvis20201102, Norrkoping, Sweden, May, 2020.
##one dataset data(Hepta) ClusterabilityMDplot(Hepta$Data) ##multiple datasets data(Atom) data(Chainlink) data(Lsun3D) data(GolfBall) data(EngyTime) data(Target) data(Tetra) data(WingNut) data(TwoDiamonds) DataV = list( Atom = Atom$Data, Chainlink = Chainlink$Data, Hepta = Hepta$Data, Lsun3D = Lsun3D$Data, GolfBall = GolfBall$Data, EngyTime = EngyTime$Data, Target = Target$Data, Tetra = Tetra$Data, WingNut = WingNut$Data, TwoDiamonds = TwoDiamonds$Data ) ClusterabilityMDplot(DataV)
##one dataset data(Hepta) ClusterabilityMDplot(Hepta$Data) ##multiple datasets data(Atom) data(Chainlink) data(Lsun3D) data(GolfBall) data(EngyTime) data(Target) data(Tetra) data(WingNut) data(TwoDiamonds) DataV = list( Atom = Atom$Data, Chainlink = Chainlink$Data, Hepta = Hepta$Data, Lsun3D = Lsun3D$Data, GolfBall = GolfBall$Data, EngyTime = EngyTime$Data, Target = Target$Data, Tetra = Tetra$Data, WingNut = WingNut$Data, TwoDiamonds = TwoDiamonds$Data ) ClusterabilityMDplot(DataV)
Applies a given function to each dimension d
of data separately for each cluster
ClusterApply(DataOrDistances,FUN,Cls,Simple=FALSE,...)
ClusterApply(DataOrDistances,FUN,Cls,Simple=FALSE,...)
DataOrDistances |
[1:n,1:d] with: if d=n and symmetric then distance matrix assumed, otherwise: [1:n,1:d] matrix of defining the dataset that consists of |
FUN |
Function to be applied to each cluster of data and each column of data |
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Simple |
Boolean, if TRUE, simplifies output |
... |
Additional parameters to be passed on to FUN |
Applies a given function to each feature of each cluster of data using the clustering stored in Cls
which is the cluster identifiers for all rows in data. If missing, all data are in first cluster, The main output is FUNPerCluster[i]
which is the result of FUN
for the data points in cluster of UniqueClusters[i]
named with the function's name used.
In case of a distance matrix an automatic classical multidimensional scaling transformation of distances to data is computed. Number of dimensions is selected by the minimal stress w.r.t. the possible output dimensions of cmdscale.
If FUN
has not function name, then ResultPerCluster is given back.
if(Simple==FALSE) List with
UniqueClusters |
The unique clusters in Cls |
FUNPerCluster |
a matrix of [1:k,1:d] of d features and k clusters, the list element is named by the function |
if(Simple==TRUE)
a matrix of [1:k,1:d] of d features and k clusters
Felix Pape, Michael Thrun
##one dataset data(Hepta) Data=Hepta$Data Cls=Hepta$Cls #mean per cluster ClusterApply(Data,mean,Cls) #Simplified ClusterApply(Data,mean,Cls,Simple=TRUE) # Mean per cluster of MDS transformation # Beware, this is not the same! ClusterApply(as.matrix(dist(Data)),mean,Cls) ## Not run: Iris=datasets::iris Distances=as.matrix(Iris[,1:4]) SomeFactors=Iris$Species V=ClusterCreateClassification(SomeFactors) Cls=V$Cls V$ClusterNames ClusterApply(Distances,mean,Cls) ## End(Not run) #special case of identity ## Not run: suppressPackageStartupMessages(library('prabclus',quietly = TRUE)) data(tetragonula) #Generated Specific Distance Matrix ta <- alleleconvert(strmatrix=as.matrix(tetragonula[1:236,])) tai <- alleleinit(allelematrix=ta,distance="none") Distance=alleledist((unbuild.charmatrix(tai$charmatrix,236,13)),236,13) MDStrans=ClusterApply(Distance,identity)$identityPerCluster ## End(Not run)
##one dataset data(Hepta) Data=Hepta$Data Cls=Hepta$Cls #mean per cluster ClusterApply(Data,mean,Cls) #Simplified ClusterApply(Data,mean,Cls,Simple=TRUE) # Mean per cluster of MDS transformation # Beware, this is not the same! ClusterApply(as.matrix(dist(Data)),mean,Cls) ## Not run: Iris=datasets::iris Distances=as.matrix(Iris[,1:4]) SomeFactors=Iris$Species V=ClusterCreateClassification(SomeFactors) Cls=V$Cls V$ClusterNames ClusterApply(Distances,mean,Cls) ## End(Not run) #special case of identity ## Not run: suppressPackageStartupMessages(library('prabclus',quietly = TRUE)) data(tetragonula) #Generated Specific Distance Matrix ta <- alleleconvert(strmatrix=as.matrix(tetragonula[1:236,])) tai <- alleleinit(allelematrix=ta,distance="none") Distance=alleledist((unbuild.charmatrix(tai$charmatrix,236,13)),236,13) MDStrans=ClusterApply(Distance,identity)$identityPerCluster ## End(Not run)
Adjusted Rand index for two clusterings that should be compared to each other. This index has expected value zero for independant clusterings and maximum value 1 (for identical clusterings).
ClusterARI(Cls1, Cls2,Fast=TRUE)
ClusterARI(Cls1, Cls2,Fast=TRUE)
Cls1 |
1:n numerical vector of numbers defining the classification as the main output of the first clustering or trial for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Cls2 |
1:n numerical vector of numbers defining the classification as the main output of the second clustering algorithm trial for the n cases of data. It has p unique numbers representing the arbitrary labels of the clustering. |
Fast |
TRUE:uses mclust package which maybe does not integrate all published insights about ARI FALSE: uses partitionComparison package |
"The expected value of the Rand Index of two random partitions does not take a constant value (e.g. zero). Thus, Hubert and Arabie proposed an adjustment [Hubert & Arabie] which assumes a generalized hypergeometric distribution as null hypothesis: the two clusterings are drawn randomly with a fixed number of clusters and a fixed number of elements in each cluster (the number of clusters in the two clusterings need not be the same). Then the adjusted Rand Index is the (normalized) difference of the Rand Index and its expected value under the null hypothesis. The significance of this measure has to be put into question because of the strong assumptions it makes on the distribution. Meila [Meila, 2003] notes, that some pairs of clusterings may result in negative index values" [Wagner and Wagner, 2007].
value of adjusted rand index
the equation of adjusted random index ignores the labels themselve and measures only the agreement. Hence, one can compare clusterin solutions for k!=p unique numbers that represent the labels, see second example
Michael Thrun
[Rand, 1971] Rand, W. M.: Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, Vol. 66(336), pp. 846-850, 1971.
[Hubert & Arabie] Hubert, L. and Arabie, P.: Comparing partitions, Journal of Classification. Vol. 2 (1), pp. 193-218. doi:10.1007/BF01908075, 1985.
[Ball/Geyer-Schulz, 2018] Ball, F., & Geyer-Schulz, A.: Invariant Graph Partition Comparison Measures, Symmetry, Vol. 10(10), pp. 1-27, 2018.
[Meila, 2003] Meila, Marina: Comparing Clusterings. COLT 2003.
[Wagner and Wagner, 2007] Wagner, Silke; Wagner, Dorothea. Comparing clusterings: an overview. Karlsruhe: Universitaet Karlsruhe, Fakultaet für Informatik, 2007.
data(Hepta) #compare to baseline Cls2=kmeansClustering(Hepta$Data,7,Type = "Steinley")$Cls ClusterARI(Hepta$Cls,Cls2) #compare different solutions Cls3=kmeansClustering(Hepta$Data,5)$Cls ClusterARI(Cls3,Cls2)
data(Hepta) #compare to baseline Cls2=kmeansClustering(Hepta$Data,7,Type = "Steinley")$Cls ClusterARI(Hepta$Cls,Cls2) #compare different solutions Cls3=kmeansClustering(Hepta$Data,5)$Cls ClusterARI(Cls3,Cls2)
Lsun3D and FCPS datasets were introduced in various publications for a specific fixed size. This function generalizes them for any sample size.
ClusterChallenge(Name,SampleSize, PlotIt=FALSE,PointSize=1,Plotter3D="rgl",...)
ClusterChallenge(Name,SampleSize, PlotIt=FALSE,PointSize=1,Plotter3D="rgl",...)
Name |
string, either 'Atom', 'Chainlink, 'EngyTime', 'GolfBall', 'Hepta', 'Lsun3D', 'Target' 'Tetra' 'TwoDiamonds' 'WingNut |
SampleSize |
Size of Sample higher than 300, preferable above 500 |
PlotIt |
TRUE: Plots the challenge with |
PointSize |
If PlotIt=TRUE: see |
Plotter3D |
If PlotIt=TRUE: see |
... |
If PlotIt=TRUE: further arguments for |
A detailed description of the datasets can be found in [Thrun/Ultsch 2020]. Sampling works by combining Pareto Density Estimation with rejection sampling.
LIST, with
Name |
[1:SampleSize,1:d] data matrix |
Cls |
[1:SampleSize] numerical vector of classification |
Michael Thrun
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. in press, pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
## Not run: ClusterChallenge("Chainlink",2000,PlotIt=TRUE) ## End(Not run)
## Not run: ClusterChallenge("Chainlink",2000,PlotIt=TRUE) ## End(Not run)
Calulates statistics for clustering in each group of the data points
ClusterCount(Cls,Ordered=TRUE,NonFinite=9999)
ClusterCount(Cls,Ordered=TRUE,NonFinite=9999)
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Ordered |
Optional, boolean, if TRUE: the ouput is ordered increasingly by cluster labels in |
NonFinite |
Optional, If non finite values are given in the numerical vector, they are set to the scalar value defined here |
The ordering of the output is defined by the first occurence of every cluster label in Cls
in the setting of Ordered=FALSE
.
The function can be overloaded with non-numerical vectors. In this case, a cast via as.character() is applied to Cls
, a warning is stated, and the statistics are still computed.
UniqueClusters |
[1:k] numerical vector of the k unique clusters in Cls |
CountPerCluster |
Named vector [1:k] with the number of data points in the corresponding unique clusters. Names are the |
NumberOfClusters |
The number of clusters k |
ClusterPercentages |
[1:k] numerical vector of the percentages of datapoints belonging to a cluster for each cluster |
Michael Thrun
data('Hepta') Cls=Hepta$Cls ClusterCount(Cls)
data('Hepta') Cls=Hepta$Cls ClusterCount(Cls)
Creates a Cls from arbitrary list of objects
ClusterCreateClassification(Objects,Decreasing)
ClusterCreateClassification(Objects,Decreasing)
Objects |
Listed objects, for example factor |
Decreasing |
Boolean that can be missing. If given, sorts |
ClusterNames
can be sorted before the classification stored Cls
is created. See example.
LIST, with
Cls |
[1:n] numerical vector with n numbers defining the labels of the classification. It has 1 to k unique numbers representing the arbitrary labels of the classification. |
ClusterNames |
ClusterNames defined which names belongs to which unique number |
Michael Thrun
## Not run: Iris=datasets::iris SomeFactors=Iris$Species V=ClusterCreateClassification(SomeFactors) Cls=V$Cls V$ClusterNames table(Cls,SomeFactors) #Increasing alphabetical order V=ClusterCreateClassification(SomeFactors,Decreasing=FALSE) Cls=V$Cls V$ClusterNames table(Cls,SomeFactors) ## End(Not run)
## Not run: Iris=datasets::iris SomeFactors=Iris$Species V=ClusterCreateClassification(SomeFactors) Cls=V$Cls V$ClusterNames table(Cls,SomeFactors) #Increasing alphabetical order V=ClusterCreateClassification(SomeFactors,Decreasing=FALSE) Cls=V$Cls V$ClusterNames table(Cls,SomeFactors) ## End(Not run)
Internal (i.e. without prior classification) cluster quality measure called Davies Bouldin index for a given clustering published in [Davies/Bouldin, 1979].
ClusterDaviesBouldinIndex(Cls, Data,...)
ClusterDaviesBouldinIndex(Cls, Data,...)
Cls |
[1:n] numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Data |
matrix, [1:d,1:n] dataset of d variables and n cases |
... |
Further arguments passed on to the |
Wrapper for index.DB
. Davies Bouldin index is defined in [Davies/Bouldin, 1979]. Best clustering scheme essentially minimizes the Davies-Bouldin index because it is defined as the function of the ratio of the within cluster scatter, to the between cluster separation.[Davies/Bouldin, 1979].
List of
DaviesBouldinIndex |
scalar,Davies Bouldin index |
Object |
further information stored in |
Michael Thrun
[Davies/Bouldin, 1979] Davies, D. L., & Bouldin, D. W.: A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 1(2), pp. 224-227. doi 10.1109/TPAMI.1979.4766909, 1979.
data("Hepta") Cls=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Hartigan")$Cls ClusterDaviesBouldinIndex(Cls,Hepta$Data)[1] data("Hepta") ClsWellSeperated=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Steinley")$Cls ClusterDaviesBouldinIndex(ClsWellSeperated,Hepta$Data)[1]
data("Hepta") Cls=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Hartigan")$Cls ClusterDaviesBouldinIndex(Cls,Hepta$Data)[1] data("Hepta") ClsWellSeperated=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Steinley")$Cls ClusterDaviesBouldinIndex(ClsWellSeperated,Hepta$Data)[1]
Presents a dendrogram of a given tree using a colorsequence for the branches defined from the highest cluster size to the lowest cluster size.
ClusterDendrogram(TreeOrDendrogram, ClusterNo, Colorsequence,main='Name of Algorithm')
ClusterDendrogram(TreeOrDendrogram, ClusterNo, Colorsequence,main='Name of Algorithm')
TreeOrDendrogram |
Either object of hcclust defining the tree, third list element of hierarchical cluster algorithms of this package or Object of class dendrogram, second list element of hierarchical cluster algorithms. |
ClusterNo |
k number of clusters for cutree. |
Colorsequence |
[1:k] character vector of colors, per default the colorsquence defined in the DataVisualizations is used |
main |
Title of plot |
Reqires the package dendextend to work correctly.
In mode invisible:
[1:n] numerical vector defining the clustering of k clusters; this classification is the main output of the algorithm.
Michael Thrun
data(Lsun3D) listofh=HierarchicalClustering(Lsun3D$Data,0,'SingleL') Tree=listofh$Object #given colors are per default: #"magenta" "yellow" "black" "red" ClusterDendrogram(Tree, 4,main='Single Linkage Clustering') listofh=HierarchicalClustering(Lsun3D$Data,4) ClusterCount(listofh$Cls) #c1 is magenta, c2 is red, c3 is yellow, c4 is black #because the order of the cluster sizes is #c1,c3,c4,c2
data(Lsun3D) listofh=HierarchicalClustering(Lsun3D$Data,0,'SingleL') Tree=listofh$Object #given colors are per default: #"magenta" "yellow" "black" "red" ClusterDendrogram(Tree, 4,main='Single Linkage Clustering') listofh=HierarchicalClustering(Lsun3D$Data,4) ClusterCount(listofh$Cls) #c1 is magenta, c2 is red, c3 is yellow, c4 is black #because the order of the cluster sizes is #c1,c3,c4,c2
Computes intra-cluster distances which are the distance in-between each cluster.
ClusterDistances(FullDistanceMatrix, Cls, Names, PlotIt = FALSE) ClusterIntraDistances(FullDistanceMatrix, Cls, Names, PlotIt = FALSE)
ClusterDistances(FullDistanceMatrix, Cls, Names, PlotIt = FALSE) ClusterIntraDistances(FullDistanceMatrix, Cls, Names, PlotIt = FALSE)
FullDistanceMatrix |
[1:n,1:n] symmetric distance matrix |
Cls |
[1:n] numerical vector of k classes |
Names |
Optional [1:k] character vector naming k classes |
PlotIt |
Optional, Plots if TRUE |
Cluster distances are given back as a matrix, one column per cluster and the vector of the full distance matrix without the diagonal elements and the upper half of the symmetric matrix. Details and definitons can be found in [Thrun, 2021].
Matrix [1:m,1:(k+1)] of k clusters, each columns consists of the distances in a cluster, filled up with NaN at the end to be of the same length as the vector of the upper triangle of the complete distance matrix.
Michael Thrun
[Thrun, 2021] Thrun, M. C.: The Exploitation of Distance Distributions for Clustering, International Journal of Computational Intelligence and Applications, Vol. 20(3), pp. 2150016, DOI: doi:10.1142/S1469026821500164, 2021.
data(Hepta) Distance=as.matrix(dist(Hepta$Data)) interdists=ClusterDistances(Distance,Hepta$Cls)
data(Hepta) Distance=as.matrix(dist(Hepta$Data)) interdists=ClusterDistances(Distance,Hepta$Cls)
Internal (i.e. without prior classification) cluster quality measure called Dunn index for a given clustering published in [Dunn, 1974].
ClusterDunnIndex(Cls,DataOrDistances, DistanceMethod="euclidean",Silent=TRUE,Force=FALSE,...)
ClusterDunnIndex(Cls,DataOrDistances, DistanceMethod="euclidean",Silent=TRUE,Force=FALSE,...)
Cls |
[1:n] numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
DataOrDistances |
matrix, DataOrDistance[1:n,1:n] symmetric matrix of dissimilarities, if variable unsymmetric DataOrDistance[1:d,1:n] is assumed as a dataset and the euclidean distances are calculated of d variables and n cases |
DistanceMethod |
Optional, one of 39 distance methods of |
Silent |
TRUE: Warnings are shown |
Force |
TRUE: force computing in case of numerical instability |
... |
Further arguments passed on to the |
Dunn index is defined as Dunn=min(InterDist)/max(IntraDist)
. Well seperated clusters have usually a dunn index above 1, for details please see [Dunn, 1974].
List of
Dunn |
scalar, Dunn Index |
IntraDist |
[1:k] numerical vector of minimal intra cluster distances per given cluster |
InterDist |
[1:k] numerical vector of minimal inter cluster distances per given cluster |
Michael Thrun
[Dunn, 1974] Dunn, J. C.: Well_separated clusters and optimal fuzzy partitions, Journal of cybernetics, Vol. 4(1), pp. 95-104. 1974.
data("Hepta") Cls=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Hartigan")$Cls ClusterDunnIndex(Cls,Hepta$Data) data("Hepta") ClsWellSeperated=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Steinley")$Cls ClusterDunnIndex(ClsWellSeperated,Hepta$Data)
data("Hepta") Cls=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Hartigan")$Cls ClusterDunnIndex(Cls,Hepta$Data) data("Hepta") ClsWellSeperated=kmeansClustering(Hepta$Data,ClusterNo = 7,Type="Steinley")$Cls ClusterDunnIndex(ClsWellSeperated,Hepta$Data)
Weights clusters equally
ClusterEqualWeighting(Cls, Data, MinClusterSize)
ClusterEqualWeighting(Cls, Data, MinClusterSize)
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Data |
Optional, [1:n,1:d] matrix of dataset consisting of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
MinClusterSize |
Optional, scalar defining the number of cases m that each cluster should have |
Balance clusters such that their sizes are the same by subsampling the larger cluster. If MinClusterSize
is missing the number of cases per cluster is set to the smallest cluster size. For clusters sizes smaller than MinClusterSize
, sampling with replacement is turned on, i.e. up sampling. For clusters sizes equal to MinClusterSize
, no sampling is performed.
List of
BalancedCls |
Vector of Cls such that all clusters have the same sizes spezified by |
BalancedInd |
index such that BalancedCls = Cls[BalancedInd] |
BalancedData |
NULL if missing, otherwise, Data[BalancedInd,] |
Alfred Ultsch (matlab), reimplemented by Michael Thrun
data(Hepta) ClusterEqualWeighting(Hepta$Cls,Hepta$Data,5)
data(Hepta) ClusterEqualWeighting(Hepta$Cls,Hepta$Data,5)
ClusterAccuracy
ClusterAccuracy(PriorCls,CurrentCls,K=9)
ClusterAccuracy(PriorCls,CurrentCls,K=9)
PriorCls |
Ground truth,[1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the arbitrary labels of the clustering. |
CurrentCls |
Main output of the clustering, [1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the arbitrary labels of the clustering. |
K |
Maximal number of classes for computation. |
Here, accuracy is defined as the normalized sum over all true positive labeled data points of a clustering algorithm. The best of all permutation of labels with the highest accuracy is selected in every trial because algorithms arbitrarily define the labels [Thrun et al., 2018]. Beware that in contrast to ClusterMCC
, the labels can be arbitrary. However, accuracy is a only a valid quality measure if the clusters are balanced (of) nearly equal size). Ohterwise please use ClusterMCC
.
In contrast to the F-measure, "Accuracy tends to be naturally unbiased, because it can be expressed in terms of a binomial distribution: A success in the underlying Bernoulli trial would be defined as sampling an example for which a classifier under consideration makes the right prediction. By definition, the success probability is identical to the accuracy of the classifier. The i.i.d. assumption implies that each example of the test set is sampled independently, so the expected fraction of correctly classified samples is identical to the probability of seeing a success above. Averaging over multiple folds is identical to increasing the number of repetitions of the Binomial trial. This does not affect the posterior distribution of accuracy if the test sets are of equal size, or if we weight each estimate by the size of each test set." [Forman/Scholz, 2010]
Single scalar of Accuracy between zero and one
Michael Thrun
[Thrun et al., 2018] Michael C. Thrun, Felix Pape, Alfred Ultsch: Benchmarking Cluster Analysis Methods in the Case of Distance and Density-based Structures Defined by a Prior Classification Using PDE-Optimized Violin Plots, ECDA, Potsdam, 2018
[Forman/Scholz, 2010] Forman, G., and Scholz, M.: Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement, ACM SIGKDD Explorations Newsletter, Vol. 12(1), pp. 49-57. 2010.
#Influence of random sets/ random starts on k-means data('Hepta') Cls=kmeansClustering(Hepta$Data,7,Type = "Hartigan",nstart=1) table(Cls$Cls,Hepta$Cls) ClusterAccuracy(Hepta$Cls,Cls$Cls) data('Hepta') Cls=kmeansClustering(Hepta$Data,7,Type = "Hartigan",nstart=100) table(Cls$Cls,Hepta$Cls) ClusterAccuracy(Hepta$Cls,Cls$Cls)
#Influence of random sets/ random starts on k-means data('Hepta') Cls=kmeansClustering(Hepta$Data,7,Type = "Hartigan",nstart=1) table(Cls$Cls,Hepta$Cls) ClusterAccuracy(Hepta$Cls,Cls$Cls) data('Hepta') Cls=kmeansClustering(Hepta$Data,7,Type = "Hartigan",nstart=100) table(Cls$Cls,Hepta$Cls) ClusterAccuracy(Hepta$Cls,Cls$Cls)
Computes inter-cluster distances which are the distance between each cluster and all other clusters
ClusterInterDistances(FullDistanceMatrix, Cls, Names,PlotIt=FALSE)
ClusterInterDistances(FullDistanceMatrix, Cls, Names,PlotIt=FALSE)
FullDistanceMatrix |
[1:n,1:n] symmetric distance matrix |
Cls |
[1:n] numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Names |
Optional [1:k] character vector naming k classes |
PlotIt |
Optional, Plots if TRUE |
Cluster distances are given back as a matrix, one column per cluster and the vector of the full distance matrix without the diagonal elements and the upper half of the symmetric matrix. Details and definitons can be found in [Thrun, 2021].
Matrix [1:m,1:(k+1)] of k clusters, each columns consists of the distances between a cluster and all other clusters, filled up with NaN at the end to be of the same lenght as the vector of the upper triangle of the complete distance matrix.
Michael Thrun
[Thrun, 2021] Thrun, M. C.: The Exploitation of Distance Distributions for Clustering, International Journal of Computational Intelligence and Applications, Vol. 20(3), pp. 2150016, DOI: doi:10.1142/S1469026821500164, 2021.
ClusterDistances
data(Hepta) Distance=as.matrix(dist(Hepta$Data)) interdists=ClusterInterDistances(Distance,Hepta$Cls)
data(Hepta) Distance=as.matrix(dist(Hepta$Data)) interdists=ClusterInterDistances(Distance,Hepta$Cls)
Matthews correlation coefficient eneralized to the multiclass case (a.k.a. R_K statistic).
ClusterMCC(PriorCls, CurrentCls,Force=TRUE)
ClusterMCC(PriorCls, CurrentCls,Force=TRUE)
PriorCls |
Ground truth,[1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the labels of the clustering. |
CurrentCls |
Main output of the clustering, [1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the labels of the clustering. |
Force |
Boolean, if is TRUE: forces code even if one or more than one of the k numbers given in |
Contrary to accuracy, the MCC is balanced measure which can be used even if the classes are of very different sizes. When there are more than two labels the MCC will no longer range between -1 and +1. Instead the minimum value will be between -1 and 0 depending on the true distribution. The maximum value is always +1.
Beware that in contrast to ClusterAccuracy
, the labels cannot be arbitrary. Instead each label of PriorCls
and CurrentCls
has to be mapped to the same cluster of data points. Typically this has to be ensured manually.
Single scalar of MCC in a range described in details.
If No. of Clusters is not equivalent, internally the number is allgined with zero datapoints belonging to the missing clusters.
Michael Thrun
Matthews, B. W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochimica et Biophysica Acta (BBA), Protein Structure, Vol. 405(2), pp. 442-451, 1975.
Boughorbel, S.B: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric, PLOS ONE, Vol. 12(6), pp. e0177678, 2017.
Chicco, D.; Toetsch, N. and Jurman, G.: The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two_class confusion matrix evaluation. BioData Mining. Vol. 14., 2021.
#Beware that algorithm arbitrary defines the labels data(Hepta) V=kmeansClustering(Hepta$Data,Type = "Hartigan",7) table(V$Cls,Hepta$Cls) #result is only valid if the above issue is resolved manually ClusterMCC(Hepta$Cls,V$Cls)
#Beware that algorithm arbitrary defines the labels data(Hepta) V=kmeansClustering(Hepta$Data,Type = "Hartigan",7) table(V$Cls,Hepta$Cls) #result is only valid if the above issue is resolved manually ClusterMCC(Hepta$Cls,V$Cls)
Calculation of up to 26 indicators and the recommendations based on them for the number of clusters in data sets. For a given dataset and clusterings for this dataset, key indicators mentioned in details are calculated and based on this a recommendation regarding the number of clusters is given for each indicator.
An alternative estimation of the cluster number can be done by counting the valleys of the topographic map of the generalized U-Matrix for a specfic projection method using the ProjectionBasesdClustering and GeneralizedUmatrix packages on CRAN, see [Thrun/Ultsch, 2021] for details.
ClusterNoEstimation(DataOrDistances, ClsMatrix = NULL, MaxClusterNo, ClusterIndex = "all", Method = NULL, MinClusterNo = 2, Silent = TRUE,PlotIt=FALSE,SelectByABC=TRUE,Colorsequence,...)
ClusterNoEstimation(DataOrDistances, ClsMatrix = NULL, MaxClusterNo, ClusterIndex = "all", Method = NULL, MinClusterNo = 2, Silent = TRUE,PlotIt=FALSE,SelectByABC=TRUE,Colorsequence,...)
DataOrDistances |
Either [1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or Symmetric [1:n,1:n] distance matrix |
ClsMatrix |
[1:n,1:(MaxClusterNo)] matrix of clusterings each columns is defined as: 1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. (see also details (2) and (3)), must be specified if method = NULL |
MaxClusterNo |
Highest number of clusters to be checked |
Method |
Cluster procedure, with which the clusterings are created (see details (4) for possible methods), must be specified if ClsMatrix = NULL |
Optional:
ClusterIndex |
String or vector of strings with the indicators to be calculated (see details (1)), default = "all |
MinClusterNo |
Lowest number of clusters to be checked, default = 2 |
Silent |
If TRUE status messages are output, default = FALSE |
PlotIt |
If TRUE plots fanplot with proposed cluster numbers |
SelectByABC |
If PlotIt=TRUE, TRUE: Plots group A of ABCanalysis of the most important ones (highest overlap in indicators), FALSE: plots all indicators |
Colorsequence |
Optional, character vector of sufficient length of colors for the fan plot.If the sequence is too long the first part of the sequence is used. |
... |
Optional, further arguents used if clustering methods if |
Each column of ClsMatrix
has to have at least two unqiue clusters defined. Otherwise the function will stop.
(1)
The following 26 indicators can be calculated: "ball", "beale", "calinski", "ccc", "cindex", "db", "duda", "dunn", "frey", "friedman", "hartigan", "kl", "marriot", "mcclain", "pseudot2", "ptbiserial", "ratkowsky", "rubin", "scott", "sdbw", "sdindex", "silhouette", "ssi", "tracew", "trcovw", "xuindex".
These can be specified individually or as a vector via the parameter index. If you enter 'all', all key figures are calculated.
(2)
The indicators kl, duda, pseudot2, beale, frey and mcclain require a clustering for MaxClusterNo+1 clusters. If these key figures are to be calculated, this clustering must be specified in cls.
(3)
The indicator kl requires a clustering for MinClusterNo-1 clusters. If this key figure is to be calculated, this clustering must also be specified in cls. For the case MinClusterNo = 2 no clustering for 1 has to be given.
(4)
The following methods can be used to create clusterings:
"kmeans," "DBSclustering","DivisiveAnalysisClustering","FannyClustering", "ModelBasedClustering","SpectralClustering" or all methods found in HierarchicalClustering
.
(5)
The indicators duda, pseudot2, beale and frey are only intended for use in hierarchical cluster procedures.
If a distances matrix is given, then ProjectionBasedClustering is required to be accessible.
Indicators |
A table of the calculated indicators except Duda, Pseudot2 and Beale |
ClusterNo |
The recommended number of clusters for each calculated indicator |
ClsMatrix |
[1:n,MinClusterNo:(MaxClusterNo)] Output of the clusterings used for the calculation |
HierarchicalIndicators |
Either NULL or the values for the indicators Duda, Pseudot2 and Beale in case of hierarchical cluster procedures, if calculated |
Code of "calinski", "cindex", "db", "hartigan", "ratkowsky", "scott", "marriot", "ball", "trcovw", "tracew", "friedman", "rubin", "ssi" of package cclust ist adapted for the purpose of this function.
Colorsequence works if DataVisualizations 1.1.13 is installed (currently only on github available).
Peter Nahrgang, revised by Michael Thrun (2021)
Charrad, Malika, et al. "Package 'NbClust', J. Stat. Soft Vol. 61, pp. 1-36, 2014.
Dimtriadou, E. "cclust: Convex Clustering Methods and Clustering Indexes." R package version 0.6-16, URL https://CRAN.R-project.org/package=cclust, 2009.
[Thrun/Ultsch, 2021] Thrun, M. C., and Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Artificial Intelligence, Vol. 290, pp. 103237, doi:10.1016/j.artint.2020.103237, 2021.
# Reading the iris dataset from the standard R-Package datasets data <- as.matrix(iris[,1:4]) MaxClusterNo = 7 # Creating the clusterings for the data set #(here with method complete) for the number of clusters 2 to 8 hc <- hclust(dist(data), method = "complete") clsm <- matrix(data = 0, nrow = dim(data)[1], ncol = MaxClusterNo) for (i in 2:(MaxClusterNo+1)) { clsm[,i-1] <- cutree(hc,i) } # Calculation of all indicators and recommendations for the number of clusters indicatorsList=ClusterNoEstimation(Data = data, ClsMatrix = clsm, MaxClusterNo = MaxClusterNo) # Alternatively, the same calculation as above can be executed with the following call ClusterNoEstimation(Data = data, MaxClusterNo = 7, Method = "CompleteL") # In this variant, the function clusterumbers also takes over the clustering
# Reading the iris dataset from the standard R-Package datasets data <- as.matrix(iris[,1:4]) MaxClusterNo = 7 # Creating the clusterings for the data set #(here with method complete) for the number of clusters 2 to 8 hc <- hclust(dist(data), method = "complete") clsm <- matrix(data = 0, nrow = dim(data)[1], ncol = MaxClusterNo) for (i in 2:(MaxClusterNo+1)) { clsm[,i-1] <- cutree(hc,i) } # Calculation of all indicators and recommendations for the number of clusters indicatorsList=ClusterNoEstimation(Data = data, ClsMatrix = clsm, MaxClusterNo = MaxClusterNo) # Alternatively, the same calculation as above can be executed with the following call ClusterNoEstimation(Data = data, MaxClusterNo = 7, Method = "CompleteL") # In this variant, the function clusterumbers also takes over the clustering
Values in Cls are consistently recoded to positive consecutive integers
ClusterNormalize(Cls)
ClusterNormalize(Cls)
Cls |
[1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
For recoding depending on cluster size please see ClusterRenameDescendingSize
.
The renamed classification. A vector of clusters recoded to positive consecutive integers.
.
data('Lsun3D') Cls=Lsun3D$Cls #not desceending cluster numbers Cls[Cls==1]=543 Cls[Cls==4]=1 # Now ordered consecutively ClusterNormalize(Cls)
data('Lsun3D') Cls=Lsun3D$Cls #not desceending cluster numbers Cls[Cls==1]=543 Cls[Cls==4]=1 # Now ordered consecutively ClusterNormalize(Cls)
This function uses a projection method to perform dimensionality reduction (DR) on order to visualize the data as 3D data points colored by a clustering.
ClusterPlotMDS(DataOrDistances, Cls, main = "Clustering", DistanceMethod = "euclidean", OutputDimension = 3, PointSize=1,Plotter3D="rgl",Colorsequence, ...)
ClusterPlotMDS(DataOrDistances, Cls, main = "Clustering", DistanceMethod = "euclidean", OutputDimension = 3, PointSize=1,Plotter3D="rgl",Colorsequence, ...)
DataOrDistances |
Either nonsymmetric [1:n,1:d] datamatrix of n cases and d features or symmetric [1:n,1:n] distance matrix |
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
main |
String, title of plot |
DistanceMethod |
Method to compute distances, default "euclidean" |
OutputDimension |
Either two or three depending on user choice |
PointSize |
Scalar defining the size of points |
Plotter3D |
In case of 3 dimensions, choose either "plotly" or "rgl", |
Colorsequence |
[1:k] character vector of colors, per default the colorsquence defined in the DataVisualizations is used |
... |
Please see |
If dataset has more than 3 dimesions, mds is performed as defined in the smacof [De Leeuw/Mair, 2011]. If smacof package is not installed, classical metric MDS (see Def. in [Thrun, 2018]) is performed. In both cases, the first OutputDimension are visualized. Points are colored by the labels (Cls).
In the special case that the dataset has not more than 3 dimensions, all dimensions are visualized and no DR is performed.
The rgl or plotly plot handler depending on Plotter3D
If DataVisualizations is not installed a 2D plot using native plot function is shown.
If MASS is not installed, classicial metric MDS is used, see [Thrun, 2018] for definition.
Michael Thrun
[De Leeuw/Mair, 2011] De Leeuw, J., & Mair, P.: Multidimensional scaling using majorization: SMACOF in R, Journal of statistical Software, Vol. 31(3), pp. 1-30. 2011.
[Thrun, 2018] Thrun, M. C.: Projection Based Clustering through Self-Organization and Swarm Intelligence, doctoral dissertation 2017, Springer, ISBN: 978-3-658-20539-3, Heidelberg, 2018.
data(Hepta) ClusterPlotMDS(Hepta$Data,Hepta$Cls) data(Leukemia) ClusterPlotMDS(Leukemia$DistanceMatrix,Leukemia$Cls)
data(Hepta) ClusterPlotMDS(Hepta$Data,Hepta$Cls) data(Leukemia) ClusterPlotMDS(Leukemia$DistanceMatrix,Leukemia$Cls)
Redfines some or all Clusters of Clustering such that the names of the numerical vectors are defined by
ClusterRedefine(Cls, NewLabels,OldLabels)
ClusterRedefine(Cls, NewLabels,OldLabels)
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
NewLabels |
[1:p], p<=k labels (identifiers) of clusters to be changed with |
OldLabels |
Optional, [1:p], p<=k labels(identifiers) of clusters to be changed, default [1:k] unique cluster Ids of |
The same ordering of NewLabels
and OldLabels
is assumend, i.e., the mapping is defined by OldLabels[i] -> NewLabels[i] with i
in [1:p]. NewLabels
can also be a vector for strings, for example for plotting.
Cls[1:n] numerical vector named after the row names of data
Michael Thrun
data('Lsun3D') Cls=Lsun3D$Cls Data=Lsun3D$Data# #prior ClsNew=unique(Cls)+10 #Redfined Clustering NewCls=ClusterRedefine(Cls,ClsNew) table(Cls,NewCls) #require(DataVisualizations) n=length(unique(Cls)) NewCls=ClusterRedefine(Cls,LETTERS[1:n]) #DataVisualizations package required if(requireNamespace("DataVisualizations")) DataVisualizations::Classplot(Data[,1],Data[,2], Cls,Names=NewCls,Plotter="ggplot",Size =1.5)
data('Lsun3D') Cls=Lsun3D$Cls Data=Lsun3D$Data# #prior ClsNew=unique(Cls)+10 #Redfined Clustering NewCls=ClusterRedefine(Cls,ClsNew) table(Cls,NewCls) #require(DataVisualizations) n=length(unique(Cls)) NewCls=ClusterRedefine(Cls,LETTERS[1:n]) #DataVisualizations package required if(requireNamespace("DataVisualizations")) DataVisualizations::Classplot(Data[,1],Data[,2], Cls,Names=NewCls,Plotter="ggplot",Size =1.5)
Renames Clustering such that the names of the numerical vectors are the row names of DataOrDistances
ClusterRename(Cls, DataOrDistances)
ClusterRename(Cls, DataOrDistances)
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
DataOrDistances |
Either nonsymmetric [1:n,1:d] datamatrix of n cases and d features or symmetric [1:n,1:n] distance matrix |
If DataOrDistances is missing or if inconsistent length, nothing is done.
Cls[1:n] numerical vector named after the row names of data
Michael Thrun
data('Hepta') Cls=Hepta$Cls Data=Hepta$Data# #prior Cls #Named Clustering ClusterRename(Cls,Data)
data('Hepta') Cls=Hepta$Cls Data=Hepta$Data# #prior Cls #Named Clustering ClusterRename(Cls,Data)
Renames the clusters of a classification in descending order.
ClusterRenameDescendingSize(Cls, ProvideClusterNames=FALSE)
ClusterRenameDescendingSize(Cls, ProvideClusterNames=FALSE)
Cls |
[1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
ProvideClusterNames |
TRUE: Provides in seperate output new and old k numbers, FALSE: simple output |
Beware: output changes in this function depending on ProvideClusterNames
in order to be congruent to prior code in a large varierity of other packages.
ProvideClusterNames==FALSE:
RenamedCls |
The renamed classification. A vector of clusters, were the largest cluster is C1 and so forth |
ProvideClusterNames==TRUE: List V with
RenamedCls |
The renamed classification. A vector of clusters, were the largest cluster is C1 and so forth |
ClusterName |
[1:k,1:2] matrix of k new numbers and prior numbers |
Michael Thrun, Alfred Ultsch
data('Lsun3D') Cls=Lsun3D$Cls #not desceending cluster numbers Cls[Cls==1]=543 Cls[Cls==4]=1 # Now ordered per cluster size and descending ClusterRenameDescendingSize(Cls)
data('Lsun3D') Cls=Lsun3D$Cls #not desceending cluster numbers Cls[Cls==1]=543 Cls[Cls==4]=1 # Now ordered per cluster size and descending ClusterRenameDescendingSize(Cls)
Shannon Information [Shannon, 1948] for each column in ClsMatrix.
ClusterShannonInfo(ClsMatrix)
ClusterShannonInfo(ClsMatrix)
ClsMatrix |
[1:n,1:C] matrix of C clusterings each columns is defined as: 1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Info[1:d] = sum(-p * log(p)/MaxInfo) for all unique cases with probability p in ClsMatrix[,c] for a column with k clusters MaxInfo = -(1/k)*log(1/k)
Info |
[1:max.nc,1:C] matrix of Shannin informaton as defined in details, each column represents one |
ClusterNo |
Number of Clusters k found for each |
MaxInfo |
max per column of |
MinInfo |
min per column of |
MedianInfo |
median per column of |
MeanInfo |
mean per column of |
reeimplemented from Alfred's Ultsch Matlab version but not verified yet.
Michael Thrun
[Shannon, 1948] Shannon, C. E.: A Mathematical Theory of Communication, Bell System Technical Journal, Vol. 27(3), pp. 379-423. doi doi:10.1002/j.1538-7305.1948.tb01338.x, 1948.
# Reading the iris dataset from the standard R-Package datasets data <- as.matrix(iris[,1:4]) max.nc = 7 # Creating the clusterings for the data set #(here with method complete) for the number of classes 2 to 8 hc <- hclust(dist(data), method = "complete") clsm <- matrix(data = 0, nrow = dim(data)[1], ncol = max.nc) for (i in 2:(max.nc+1)) { clsm[,i-1] <- cutree(hc,i) } ClusterShannonInfo(clsm)
# Reading the iris dataset from the standard R-Package datasets data <- as.matrix(iris[,1:4]) max.nc = 7 # Creating the clusterings for the data set #(here with method complete) for the number of classes 2 to 8 hc <- hclust(dist(data), method = "complete") clsm <- matrix(data = 0, nrow = dim(data)[1], ncol = max.nc) for (i in 2:(max.nc+1)) { clsm[,i-1] <- cutree(hc,i) } ClusterShannonInfo(clsm)
Wrapper for one specific internal function of L. Torgo who implemented there the relevant part of the SMOTE algorithm [Chawla et al., 2002].
ClusterUpsamplingMinority(Cls, Data, MinorityCluster, Percentage = 200, knn = 5, PlotIt = FALSE)
ClusterUpsamplingMinority(Cls, Data, MinorityCluster, Percentage = 200, knn = 5, PlotIt = FALSE)
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Data |
[1:n,1:d] datamatrix of n cases and d features |
MinorityCluster |
scalar defining the number of the cluster to be upsampeled |
Percentage |
pecentage above 100 of who many samples should be taken |
knn |
k nearest neighbors of SMOTE algorithm |
PlotIt |
TRUE: plots the result using |
the number of items m
is defined by the scalar Percentage
and the up sampling is combined with the Data
and the Cls
to DataExt
and ClsExt
such that the sample is placed thereafter.
List with
ClsExt |
1:(n+m) numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
DataExt |
[1:(n+m),1:d] datamatrix of n cases and d features |
.
L. Torgo
[Chawla et al., 2002] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P.: SMOTE: synthetic minority over-sampling technique, Journal of artificial intelligence research, Vol. 16, pp. 321-357. 2002.
data(Lsun3D) Data=Lsun3D$Data Cls=Lsun3D$Cls table(Cls) V=ClusterUpsamplingMinority(Cls,Data,4,1000) table(V$ClsExt)
data(Lsun3D) Data=Lsun3D$Data Cls=Lsun3D$Cls table(Cls) V=ClusterUpsamplingMinority(Cls,Data,4,1000) table(V$ClsExt)
Cross-entropy clustering published by [Tabor/Spurek, 2014] and implemented by [Spurek et al., 2017].
CrossEntropyClustering(Data, ClusterNo,PlotIt=FALSE,...)
CrossEntropyClustering(Data, ClusterNo,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
Contrary to most of the other implemented algorithms in this package, the results on the easiest clustering challenge of Hepta are unstable for cross-entropy clustering in the sense that the clustering is not always correct. Reproducibilty experiments should be performed (see [Tabor/Spurek, 2014]).
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Spurek et al., 2017] Spurek, P., Kamieniecki, K., Tabor, J., Misztal, K., & Śmieja, M.: R package cec, Neurocomputing, Vol. 237, pp. 410-413. 2017.
[Tabor/Spurek, 2014] Tabor, J., & Spurek, P.: Cross-entropy clustering, Pattern Recognition, Vol. 47(9), pp. 3046-3059. 2014.
data('Hepta') out=CrossEntropyClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=CrossEntropyClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Density-Based Spatial Clustering of Applications with Noise of [Ester et al., 1996].
DBSCAN(Data,Radius,minPts,Rcpp=TRUE, PlotIt=FALSE,UpperLimitRadius,...)
DBSCAN(Data,Radius,minPts,Rcpp=TRUE, PlotIt=FALSE,UpperLimitRadius,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
Radius |
Eps [Ester et al., 1996, p. 227] neighborhood in the R-ball graph/unit disk graph), size of the epsilon neighborhood. If NULL, automatic estimation is performed using insights of [Ultsch, 2005]. |
minPts |
Number of minimum points in the eps region (for core points). In principle minimum number of points in the unit disk, if the unit disk is within the cluster (core) [Ester et al., 1996, p. 228]. If NULL, 2.5 percent of points is selected. |
Rcpp |
If TRUE: fast Rcpp implementation of mlpack is used. FALSE uses dbscan package. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
UpperLimitRadius |
Limit for radius search, experimental |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector defining the clustering; this classification is the main output of the algorithm. Points which cannot be assigned to a cluster will be reported as members of the noise cluster with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Ester et al., 1996] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. Kdd, Vol. 96, pp. 226-231, 1996.
[Ultsch, 2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, In Baier, D. & Werrnecke, K. D. (Eds.), Innovations in classification, data science, and information systems, (Vol. 27, pp. 91-100), Berlin, Germany, Springer, 2005.
data('Hepta') out=DBSCAN(Hepta$Data,Radius=NULL,minPts=NULL,PlotIt=FALSE) ## Not run: #search for right parameter setting by grid search data("WingNut") Data = WingNut$Data DBSGrid <- expand.grid( Radius = seq(from = 0.01, to = 0.3, by = 0.02), minPTs = seq(from = 1, to = 50, by = 2) ) BestAcc = c() for (i in seq_len(nrow(DBSGrid))) { parameters <- DBSGrid[i,] Cls9 = DBSCAN( Data, minPts = parameters$minPTs, Radius = parameters$Radius, PlotIt = F, UpperLimitRadius = parameters$Radius )$Cls if (length(unique(Cls9)) < 5) BestAcc[i] = ClusterAccuracy(WingNut$Cls, Cls9) * 100 else BestAcc[i] = 50 } max(BestAcc) which.max(BestAcc) parameters <- DBSGrid[13,] Cls9 = DBSCAN( Data, minPts = parameters$minPTs, Radius = parameters$Radius, UpperLimitRadius = parameters$Radius, PlotIt = TRUE )$Cls ## End(Not run)
data('Hepta') out=DBSCAN(Hepta$Data,Radius=NULL,minPts=NULL,PlotIt=FALSE) ## Not run: #search for right parameter setting by grid search data("WingNut") Data = WingNut$Data DBSGrid <- expand.grid( Radius = seq(from = 0.01, to = 0.3, by = 0.02), minPTs = seq(from = 1, to = 50, by = 2) ) BestAcc = c() for (i in seq_len(nrow(DBSGrid))) { parameters <- DBSGrid[i,] Cls9 = DBSCAN( Data, minPts = parameters$minPTs, Radius = parameters$Radius, PlotIt = F, UpperLimitRadius = parameters$Radius )$Cls if (length(unique(Cls9)) < 5) BestAcc[i] = ClusterAccuracy(WingNut$Cls, Cls9) * 100 else BestAcc[i] = 50 } max(BestAcc) which.max(BestAcc) parameters <- DBSGrid[13,] Cls9 = DBSCAN( Data, minPts = parameters$minPTs, Radius = parameters$Radius, UpperLimitRadius = parameters$Radius, PlotIt = TRUE )$Cls ## End(Not run)
Swarm-based clustering by exploting self-organization, emergence, swarm intelligence and game theory published in [Thrun/Ultsch, 2021].
DatabionicSwarmClustering(DataOrDistances, ClusterNo = 0, StructureType = TRUE, DistancesMethod = NULL, PlotTree = FALSE, PlotMap = FALSE,PlotIt=FALSE, Parallel = FALSE)
DatabionicSwarmClustering(DataOrDistances, ClusterNo = 0, StructureType = TRUE, DistancesMethod = NULL, PlotTree = FALSE, PlotMap = FALSE,PlotIt=FALSE, Parallel = FALSE)
DataOrDistances |
Either nonsymmetric [1:n,1:d] numerical matrix of a dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or symmetric [1:n,1:n] distance matrix, e.g. |
ClusterNo |
Number of Clusters, if zero a the topographic map is ploted. Number of valleys equals number of clusters. |
StructureType |
Either TRUE or FALSE, has to be tested against the visualization. If colored points of clusters a divided by mountain ranges, parameter is incorrect. |
DistancesMethod |
Optional, if data matrix given, annon Euclidean distance can be selected |
PlotTree |
Optional, if TRUE: dendrogram is plotted. |
PlotMap |
Optional, if TRUE: topographic map is plotted if GeneralizedUmatrix is installed. See details. |
PlotIt |
Default: FALSE, If TRUE and dataset of [1:n,1:d] dimensions then a plot of the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Parallel |
FALSE: default implementatiomn, TRUE faster Cpp parallel implementation, for this the subsequent packages have to be installed from github, as they are not available on CRAN yet. |
This function does not enable the user first to project the data and then to test the Boolean parameter defining the type of structure contrary to the DatabionicSwarm which is an inappropriate approach in case of exploratory data analysis.
Instead, this function is implemented for the purpose of automatic benchmarking because in such a case nobody will investigate many trials with one visualization per trial.
If one would like to perform a clustering exploratively (in the sense that a prior clustering is not given for evaluation purposes), then please use the DatabionicSwarm package directly and read the vignette there. Databionic swarm is like k-means a stochastic algorithm meaning that the clustering and visualization may change between trials.
If PlotMap==TRUE
and ClusterNo=0
a topview of the topographic map is shown, in which the points are not labeled, i.e. colored by the same color. If PlotMap==TRUE
and ClusterNo>0
, then the points are colored by their cluster labels. If you would like to look an 3D topogrpahic map that can be interactively rotated or use 3D printing of the high-dimensional structures [Thrun et al., 2016], please see plotTopographicMap
for further details.
List of
Cls |
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
List of further output of DBS |
Current implementation is not efficient enough to cluster more than N=4000 cases as in that case it takes longer than a day for a result.
Michael Thrun
[Thrun/Ultsch, 2021] Thrun, M. C., and Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Artificial Intelligence, Vol. 290, pp. 103237, doi:10.1016/j.artint.2020.103237, 2021.
[Thrun/Ultsch, 2021] Thrun, M. C., & Ultsch, A.: Swarm Intelligence for Self-Organized Clustering (Extended Abstract), in Bessiere, C. (Ed.), 29th International Joint Conference on Artificial Intelligence (IJCAI), Vol. IJCAI-20, pp. 5125–5129, doi:10.24963/ijcai.2020/720, Yokohama, Japan, Jan., 2021.
[Thrun et al., 2016] Thrun, M. C., Lerch, F., Lötsch, J., & Ultsch, A. : Visualization and 3D Printing of Multivariate Data of Biomarkers, in Skala, V. (Ed.), International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Vol. 24, Plzen, 2016.
Pswarm
, DBSclustering
,GeneratePswarmVisualization
# Generate random but small non-structured data set data = cbind( sample(1:100, 300, replace = TRUE), sample(1:100, 300, replace = TRUE), sample(1:100, 300, replace = TRUE) ) # Make sure there are no structures # (sample size is small and still could generate structures randomly) if(requireNamespace('DataVisualizations',quietly = TRUE)){ Data = DataVisualizations::RobustNormalization(data, Centered = TRUE) #DataVisualizations::Plot3D(Data) # No structres are visible # Topographic map looks like "egg carton" # with every point in its own valley ClsV = DatabionicSwarmClustering(Data, 0, PlotMap = TRUE) }else{ # only for testing purposes of CRAN! # in case CRAN tests with no suggest packages available # please use alpways some kind of standardization! ClsV = DatabionicSwarmClustering(data, 0, PlotMap = TRUE) } # Distance based cluster structures # 7 valleys are visible, thus ClusterNo=7 data(Hepta) #DataVisualizations::Plot3D(Hepta$Data) ClsV = DatabionicSwarmClustering(Hepta$Data, 0, PlotMap = TRUE) #entagled, complex, and non-linear seperable structures ## Not run: #takes too long for CRAN tests data(Chainlink) #DataVisualizations::Plot3D(Chainlink$Data) # 2 valleys are visible, thus ClusterNo=2 ClsV = DatabionicSwarmClustering(Chainlink$Data, 0, PlotMap = TRUE) # Experiment with parameter StructureType only # reveals that clustering is appropriate # if StructureType=FALSE ClsV2 = DatabionicSwarmClustering(Chainlink$Data, 2, StructureType = FALSE, PlotMap = TRUE) # Here clusters (colored points) # are not seperated by valleys ClsV = DatabionicSwarmClustering(Chainlink$Data, 2, StructureType = TRUE, PlotMap = TRUE) ## End(Not run)
# Generate random but small non-structured data set data = cbind( sample(1:100, 300, replace = TRUE), sample(1:100, 300, replace = TRUE), sample(1:100, 300, replace = TRUE) ) # Make sure there are no structures # (sample size is small and still could generate structures randomly) if(requireNamespace('DataVisualizations',quietly = TRUE)){ Data = DataVisualizations::RobustNormalization(data, Centered = TRUE) #DataVisualizations::Plot3D(Data) # No structres are visible # Topographic map looks like "egg carton" # with every point in its own valley ClsV = DatabionicSwarmClustering(Data, 0, PlotMap = TRUE) }else{ # only for testing purposes of CRAN! # in case CRAN tests with no suggest packages available # please use alpways some kind of standardization! ClsV = DatabionicSwarmClustering(data, 0, PlotMap = TRUE) } # Distance based cluster structures # 7 valleys are visible, thus ClusterNo=7 data(Hepta) #DataVisualizations::Plot3D(Hepta$Data) ClsV = DatabionicSwarmClustering(Hepta$Data, 0, PlotMap = TRUE) #entagled, complex, and non-linear seperable structures ## Not run: #takes too long for CRAN tests data(Chainlink) #DataVisualizations::Plot3D(Chainlink$Data) # 2 valleys are visible, thus ClusterNo=2 ClsV = DatabionicSwarmClustering(Chainlink$Data, 0, PlotMap = TRUE) # Experiment with parameter StructureType only # reveals that clustering is appropriate # if StructureType=FALSE ClsV2 = DatabionicSwarmClustering(Chainlink$Data, 2, StructureType = FALSE, PlotMap = TRUE) # Here clusters (colored points) # are not seperated by valleys ClsV = DatabionicSwarmClustering(Chainlink$Data, 2, StructureType = TRUE, PlotMap = TRUE) ## End(Not run)
Density peaks clustering of [Rodriguez/Laio, 2014] is here implemented by [Pedersen et al., 2017] with estimation of [Wang et al, 2015] meaning its non adaptive in the sense of ADPclustering
.
DensityPeakClustering(DataOrDistances, Rho,Delta,Dc,Knn=7, DistanceMethod = "euclidean", PlotIt = FALSE, Data, ...)
DensityPeakClustering(DataOrDistances, Rho,Delta,Dc,Knn=7, DistanceMethod = "euclidean", PlotIt = FALSE, Data, ...)
DataOrDistances |
Either [1:n,1:n] symmetric distance matrix or [1:n,1:d] non symmetric data matrix of n cases and d variables |
Rho |
Local density of a point, see [Rodriguez/Laio, 2014] for explanation |
Delta |
Minimum distance between a point and any other point, see [Rodriguez/Laio, 2014] for explanation |
Dc |
Optional, cutoff distance, will either be estimated by [Pedersen et al., 2017] or [Wang et al, 2015] (see example below) |
Knn |
Optional k nearest neighbors |
DistanceMethod |
Optional distance method of data, default is euclid, see |
PlotIt |
Optional TRUE: Plots 2d or 3d result with clustering |
Data |
[1:n,1:d] data matrix in the case that |
... |
Optional, further arguments for |
The densityClust algorithm does not decide the k number of clusters, this has to be done by the parameter setting. This is contrary to the other version of the algorithm from another package which can be called with ADPclustering
.
The plot shows the density peaks (Cluster centers). Set Rho and Delta as boundaries below the number of relevant cluster centers for your problem. (see example below).
If Rho and Delta are set:
list of
Cls |
[1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
output of [Pedersen et al., 2017] algorithm |
If Rho and Delta are missing:
p |
object of |
Michael Thrun
[Wang et al., 2015] Wang, S., Wang, D., Li, C., & Li, Y.: Comment on" Clustering by fast search and find of density peaks", arXiv preprint arXiv:1501.04267, 2015.
[Pedersen et al., 2017] Thomas Lin Pedersen, Sean Hughes and Xiaojie Qiu: densityClust: Clustering by Fast Search and Find of Density Peaks. R package version 0.3. https://CRAN.R-project.org/package=densityClust, 2017.
[Rodriguez/Laio, 2014] Rodriguez, A., & Laio, A.: Clustering by fast search and find of density peaks, Science, Vol. 344(6191), pp. 1492-1496. 2014.
data(Hepta) H=EntropyOfDataField(Hepta$Data, seq(from=0,to=1.5,by=0.05),PlotIt=FALSE) Sigmamin=names(H)[which.min(H)] Dc=3/sqrt(2)*as.numeric(names(H)[which.min(H)]) # Look at the plot and estimate rho and delta DensityPeakClustering(Hepta$Data, Knn = 7,Dc=Dc) Cls=DensityPeakClustering(Hepta$Data,Dc=Dc,Rho = 0.028, Delta = 22,Knn = 7,PlotIt = TRUE)$Cls
data(Hepta) H=EntropyOfDataField(Hepta$Data, seq(from=0,to=1.5,by=0.05),PlotIt=FALSE) Sigmamin=names(H)[which.min(H)] Dc=3/sqrt(2)*as.numeric(names(H)[which.min(H)]) # Look at the plot and estimate rho and delta DensityPeakClustering(Hepta$Data, Knn = 7,Dc=Dc) Cls=DensityPeakClustering(Hepta$Data,Dc=Dc,Rho = 0.028, Delta = 22,Knn = 7,PlotIt = TRUE)$Cls
Divisive Analysis Clustering (diana) of [Rousseeuw/Kaufman, 1990, pp. 253-279]
DivisiveAnalysisClustering(DataOrDistances, ClusterNo, PlotIt=FALSE,Standardization=TRUE,PlotTree=FALSE,Data,...)
DivisiveAnalysisClustering(DataOrDistances, ClusterNo, PlotIt=FALSE,Standardization=TRUE,PlotTree=FALSE,Data,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm.
if |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Standardization |
|
PlotTree |
TRUE: Plots the dendrogram, FALSE: no plot |
Data |
[1:n,1:d] data matrix in the case that |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, doi: 10.1002/9780470316801, Online ISBN: 9780470316801, 1990.
data('Hepta') CA=DivisiveAnalysisClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) print(CA$Object) plot(CA$Object) ClusterDendrogram(CA$Dendrogram,7,main='DIANA')
data('Hepta') CA=DivisiveAnalysisClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) print(CA$Object) plot(CA$Object) ClusterDendrogram(CA$Dendrogram,7,main='DIANA')
Gaussian mixture. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("EngyTime")
data("EngyTime")
Size 4096, Dimensions 2, stored in EngyTime$Data
Classes 2, stored in EngyTime$Cls
[Baggenstoss, 2002] Baggenstoss, P. M.: Statistical modeling using gaussian mixtures and hmms with matlab, Naval Undersea Warfare Center, Newport RI, 2002.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(EngyTime) str(EngyTime)
data(EngyTime) str(EngyTime)
Calculates the Potential Entropy Of a Data Field for a given ranges of impact factors sigma
EntropyOfDataField(Data, sigmarange = c(0.01, 0.1, 0.5, 1, 2, 5, 8, 10, 100) , PlotIt = FALSE)
EntropyOfDataField(Data, sigmarange = c(0.01, 0.1, 0.5, 1, 2, 5, 8, 10, 100) , PlotIt = FALSE)
Data |
[1:n,1:d] data matrix |
sigmarange |
Numeric vector [1:s] of relevant sigmas |
PlotIt |
FALSE: disable plot, TRUE: Plot with upper boundary of H after [Wang et al., 2011]. |
In theory there should be a curve with a clear minimum of Entropy [Wang et al.,2011]. Then the choice for the impact factor sigma is the minimum of the entropy to define the correct data field. It follows, that the influence radius is 3/sqrt(2)*sigma (3B rule of gaussian distribution) for clustering algorithms like density peak clustering [Wang et al.,2011].
[1:s] named vector of the Entropy of data field. The names are the impact factor sigma.
Michael Thrun
[Wang et al., 2015] Wang, S., Wang, D., Li, C., & Li, Y.: Comment on" Clustering by fast search and find of density peaks", arXiv preprint arXiv:1501.04267, 2015.
[Wang et al., 2011] Wang, S., Gan, W., Li, D., & Li, D.: Data field for hierarchical clustering, International Journal of Data Warehousing and Mining (IJDWM), Vol. 7(4), pp. 43-63. 2011.
data(Hepta) H=EntropyOfDataField(Hepta$Data,PlotIt=FALSE) Sigmamin=names(H)[which.min(H)] Dc=3/sqrt(2)*as.numeric(names(H)[which.min(H)])
data(Hepta) H=EntropyOfDataField(Hepta$Data,PlotIt=FALSE) Sigmamin=names(H)[which.min(H)] Dc=3/sqrt(2)*as.numeric(names(H)[which.min(H)])
Published in [Thrun et al, 2016] for the case of automatically estimating the radius of the P-matrix. Can also be used to estimate the radius parameter for distance based clustering algorithms.
EstimateRadiusByDistance(DistanceMatrix)
EstimateRadiusByDistance(DistanceMatrix)
DistanceMatrix |
[1:n,1:n] symmetric distance Matrix of n cases |
For density-based clustering algorithms like DBSCAN
it is not always usefull.
Numerical scalar defining the radius
Symmetric matrix is assumed.
Michael Thrun
[Thrun et al., 2016] Thrun, M. C., Lerch, F., Loetsch, J., & Ultsch, A.: Visualization and 3D Printing of Multivariate Data of Biomarkers, in Skala, V. (Ed.), International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Vol. 24, pp. 7-16, Plzen, http://wscg.zcu.cz/wscg2016/short/A43-full.pdf, 2016.
data('Hepta') DistanceMatrix=as.matrix(dist(Hepta$Data)) Radius=EstimateRadiusByDistance(DistanceMatrix)
data('Hepta') DistanceMatrix=as.matrix(dist(Hepta$Data)) Radius=EstimateRadiusByDistance(DistanceMatrix)
...
FannyClustering(DataOrDistances,ClusterNo, PlotIt=FALSE,Standardization=TRUE,...)
FannyClustering(DataOrDistances,ClusterNo, PlotIt=FALSE,Standardization=TRUE,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Standardization |
|
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
...
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the second output of this algorithm |
Michael Thrun
[Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, doi: 10.1002/9780470316801, Online ISBN: 9780470316801, 1990.
data('Hepta') out=FannyClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=FannyClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Gap Statistic
GapStatistic(Data, ClusterNoMax, ClusterFun, ...)
GapStatistic(Data, ClusterNoMax, ClusterFun, ...)
Data |
[1:n,1:d] data matrix |
ClusterNoMax |
max no of clusters to beinvestigated |
ClusterFun |
which clustering algorithm to investigate |
... |
further arguments passed on |
does not work on hepta, see example
tobedocumented
Wrapper only
Michael Thrun
Tibshirani, R., Walther, G. and Hastie, T: Estimating the number of data clusters via the Gap statistic, Journal of the Royal Statistical Society B, Vol. 63, pp. 411-423, 2003.
data(Hepta) #GapStatistic(Hepta$Data,10,ClusterFun = kmeans)
data(Hepta) #GapStatistic(Hepta$Data,10,ClusterFun = kmeans)
Outlier Resistant Hierarchical Clustering Algorithm of [Gagolewski/Bartoszuk, 2016].
GenieClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,...)
GenieClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
DistanceMethod |
See |
ColorTreshold |
Draws cutline w.r.t. dendogram y-axis (height), height of line as scalar should be given |
... |
furter argument to genie like:
|
Wrapper for Genie algorithm.
List of
Cls |
If, ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise for ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
[Gagolewski/Bartoszuk, 2016] Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences, Vol. 363, pp. 8-23, 2016.
data('Hepta') Clust=GenieClustering(Hepta$Data,ClusterNo=7)
data('Hepta') Clust=GenieClustering(Hepta$Data,ClusterNo=7)
No clusters at all. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("GolfBall")
data("GolfBall")
Size 4002, Dimensions 3, stored in GolfBall$Data
Classes 1, stored in GolfBall$Cls
[Ultsch, 2005] Ultsch, A.: Clustering wih SOM: U* C, Proc. Proceedings of the 5th Workshop on Self-Organizing Maps, Vol. 2, pp. 75-82, 2005.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(GolfBall) str(GolfBall)
data(GolfBall) str(GolfBall)
Hard Competitive learning clustering published by [Ripley, 2007].
HCLclustering(Data, ClusterNo,PlotIt=FALSE,...)
HCLclustering(Data, ClusterNo,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Dimitriadou, 2002] Dimitriadou, E.: cclust-convex clustering methods and clustering indexes. R package, 2002,
[Ripley, 2007] Ripley, B. D.: Pattern recognition and neural networks, Cambridge university press, ISBN: 0521717701, 2007.
data('Hepta') out=HCLclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=HCLclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
HDD clustering is based on the Gaussian Mixture Model and on the idea that the data lives in subspaces with a lower dimension than the dimension of the original space. It uses the EM algorithm to estimate the parameters of the model [Berge et al., 2012].
HDDClustering(Data, ClusterNo, PlotIt=F,...)
HDDClustering(Data, ClusterNo, PlotIt=F,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
Optional, Numeric indicating either the number of cluster or a vector of 1:k to indicate the maximal expected number of clusters. |
PlotIt |
(optional) Boolean. Default = FALSE = No plotting performed. |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used, see |
HDD clustering maximises the BIC criterion for a range of possible number of cluster up to ClusterNo
. Per default the most general model is used, alternetively the parameter model="ALL"
can be used to evaluate all possible models with BIC [Berge et al., 2012]. If specific properties of Data
are known priorly please see hddc
for specific model selection.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Quirin Stier
[Berge et al., 2012] L. Berge, C. Bouveyron and S. Girard, HDclassif: an R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data, Journal of Statistical Software, vol. 42 (6), pp. 1-29, 2012.
[Bouveyron et al., 2007] Bouveyron, C. Girard, S. and Schmid, C: High-Dimensional Data Clustering, Computational Statistics and Data Analysis, vol. 52 (1), pp. 502-519, 2007.
# Hepta data("Hepta") Data = Hepta$Data #Non-default parameter model #can be set to evaulate all possible models V = HDDClustering(Data=Data,ClusterNo=7,model="ALL") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls) ## Not run: library(HDclassif) data(Crabs) Data = Crabs[,-1] V = HDDClustering(Data=Data,ClusterNo=4,com_dim=1) ## End(Not run)
# Hepta data("Hepta") Data = Hepta$Data #Non-default parameter model #can be set to evaulate all possible models V = HDDClustering(Data=Data,ClusterNo=7,model="ALL") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls) ## Not run: library(HDclassif) data(Crabs) Data = Crabs[,-1] V = HDDClustering(Data=Data,ClusterNo=4,com_dim=1) ## End(Not run)
Clearly defined clusters, different variances. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Hepta")
data("Hepta")
Size 212, Dimensions 3, stored in Hepta$Data
Classes 7, stored in Hepta$Cls
[Ultsch, 2003] Ultsch, A.: Maps for the visualization of high-dimensional data spaces, Proc. Workshop on Self organizing Maps (WSOM), pp. 225-230, Kyushu, Japan, 2003.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Hepta) str(Hepta)
data(Hepta) str(Hepta)
Please use HierarchicalClustering
. Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it. Uses stats package function 'hclust'.
HierarchicalClusterData(Data,ClusterNo=0, Type="ward.D2",DistanceMethod="euclidean", ColorTreshold=0,Fast=FALSE,Cls=NULL,...)
HierarchicalClusterData(Data,ClusterNo=0, Type="ward.D2",DistanceMethod="euclidean", ColorTreshold=0,Fast=FALSE,Cls=NULL,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
Type |
Methode der Clusterung: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median" or "centroid". |
DistanceMethod |
see |
ColorTreshold |
Draws cutline w.r.t. dendrogram y-axis (height), height of line as scalar should be given |
Fast |
If TRUE and fastcluster installed, then a faster implementation of the methods above can be used |
Cls |
[1:n] classification vector for coloring of dendrogram in plot |
... |
In case of plotting further argument for |
List of
Cls |
If, ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise for ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
data('Hepta') #out=HierarchicalClusterData(Hepta$Data,ClusterNo=7)
data('Hepta') #out=HierarchicalClusterData(Hepta$Data,ClusterNo=7)
Please use HierarchicalClustering
. Cluster analysis on a set of dissimilarities and methods for analyzing it. Uses stats package function 'hclust'.
HierarchicalClusterDists(pDist,ClusterNo=0,Type="ward.D2", ColorTreshold=0,Fast=FALSE,...)
HierarchicalClusterDists(pDist,ClusterNo=0,Type="ward.D2", ColorTreshold=0,Fast=FALSE,...)
pDist |
Distances as either matrix [1:n,1:n] or dist object |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
Method of cluster analysis: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median" or "centroid". |
ColorTreshold |
Draws cutline w.r.t. dendogram y-axis (height), height of line as scalar should be given |
Fast |
If TRUE and fastcluster installed, then a faster implementation of the methods above can be used |
... |
In case of plotting further argument for |
List of
Cls |
If, ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise for ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
data('Hepta') #out=HierarchicalClusterDists(as.matrix(dist(Hepta$Data)),ClusterNo=7)
data('Hepta') #out=HierarchicalClusterDists(as.matrix(dist(Hepta$Data)),ClusterNo=7)
Wrapper for various agglomerative hierarchical clustering algorithms.
HierarchicalClustering(DataOrDistances,ClusterNo,Type='SingleL',Fast=TRUE,Data,...)
HierarchicalClustering(DataOrDistances,ClusterNo,Type='SingleL',Fast=TRUE,Data,...)
DataOrDistances |
Either nonsymmetric [1:n,1:d] numerical matrix of a dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or symmetric [1:n,1:n] distance matrix, e.g. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
Method of cluster analysis: "Ward", "SingleL", "CompleteL", "AverageL" (UPGMA), "WPGMA" (mcquitty), "MedianL" (WPGMC), "CentroidL" (UPGMC), "Minimax", "MinEnergy", "Gini","HDBSCAN", or "Sparse" |
Fast |
If TRUE and fastcluster installed, then a faster implementation of the methods above can be used except for "Minimax", "MinEnergy", "Gini" or "HDBSCAN" |
Data |
[1:n,1:d] data matrix in the case that |
... |
Further arguments passed on to either |
Please see HierarchicalClusterData
and HierarchicalClusterDists
or the other functions listed above.
It should be noted that in case of "HDBSCAN" the number of clusters is manually selected by cutree
to have the same convention as the other algorithms. Usually, "HDBSCAN" selects the number of clusters automatically.
List of
Cls |
If, ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise for ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
data('Hepta') out=HierarchicalClustering(Hepta$Data,ClusterNo=7)
data('Hepta') out=HierarchicalClustering(Hepta$Data,ClusterNo=7)
Hierarchical DBSCAN clustering [Campello et al., 2015].
HierarchicalDBSCAN(DataOrDistances,minPts=4, PlotTree=FALSE,PlotIt=FALSE,...)
HierarchicalDBSCAN(DataOrDistances,minPts=4, PlotTree=FALSE,PlotIt=FALSE,...)
DataOrDistances |
Either a [1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or a [1:n,1:n] symmetric distance matrix. |
minPts |
Classic smoothing factor in density estimates [Campello et al., 2015, p.9] |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
PlotTree |
Default: FALSE, If TRUE plots the dendrogram. If minPts is missing, PlotTree is set to TRUE. |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
"Computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction proposed by Campello et al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution."[Hahsler et al., 2019]
It is claimed by the inventors that the minPts parameter is noncritical [Campello et al., 2015, p.35]. minPts is reported to be set to 4 on all experiments [Campello et al., 2015, p.35].
List of
Cls |
[1:n] numerical vector defining the clustering; this classification is the main output of the algorithm. Points which cannot be assigned to a cluster will be reported as members of the noise cluster with 0. |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Tree |
Ultrametric tree of hierarchical clustering algorithm |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Campello et al., 2015] Campello, R. J., Moulavi, D., Zimek, A., & Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 10(1), pp. 1-51. 2015.
[Hahsler et al., 2019] Hahsler M, Piekenbrock M, Doran D: dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), pp. 1-30. doi: 10.18637/jss.v091.i01, 2019
data('Hepta') out=HierarchicalDBSCAN(Hepta$Data,PlotIt=FALSE) data('Leukemia') set.seed(1234) CA=HierarchicalDBSCAN(Leukemia$DistanceMatrix) #ClusterCount(CA$Cls) #ClusterDendrogram(CA$Dendrogram,5,main='H-DBscan')
data('Hepta') out=HierarchicalDBSCAN(Hepta$Data,PlotIt=FALSE) data('Leukemia') set.seed(1234) CA=HierarchicalDBSCAN(Leukemia$DistanceMatrix) #ClusterCount(CA$Cls) #ClusterDendrogram(CA$Dendrogram,5,main='H-DBscan')
Perform k-means clustering on a data matrix.
kmeansClustering(DataOrDistances, ClusterNo, Type = 'LBG',RandomNo=5000, CategoricalData, PlotIt=FALSE, Verbose = FALSE,... )
kmeansClustering(DataOrDistances, ClusterNo, Type = 'LBG',RandomNo=5000, CategoricalData, PlotIt=FALSE, Verbose = FALSE,... )
DataOrDistances |
Either nonsymmetric [1:n,1:d] datamatrix of n cases and d numerical features or symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
Choice of Kmeans algorithm, currently either " |
RandomNo |
Only for " |
CategoricalData |
Only for " |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Verbose |
Print details, if true |
... |
Further arguments like |
Uses either stats package function 'kmeans', cclust package implemention, flexclust package implemention or own code. In case of a distance matrix, RandomNo should be significantly lower than 5000, otherwise a long computation time is to be expected.
List V of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object of the clustering algorithm used if existent, otherwise SumDistsToCentroids: Vector of within-cluster sum of squares, one component per cluster |
Centroids |
the final cluster centers. |
The version using a distance matrix is still in the test phase and not yet verified.
Michael Thrun
[Hartigan/Wong, 1979] Hartigan, J. A., & Wong, M. A.: Algorithm AS 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28(1), pp. 100-108. 1979.
[Linde et al., 1980] Linde, Y., Buzo, A., & Gray, R.: An algorithm for vector quantizer design, IEEE Transactions on communications, Vol. 28(1), pp. 84-95. 1980.
[Steinley/Brusco, 2007] Steinley, D., & Brusco, M. J.: Initializing k-means batch clustering: A critical evaluation of several techniques, Journal of Classification, Vol. 24(1), pp. 99-121. 2007.
[Forgy, 1965] Forgy, E. W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics, Vol. 21, pp. 768-769. 1965.
[MacQueen, 1967] MacQueen, J.: Some methods for classification and analysis of multivariate observations, Proc. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281-297, Oakland, CA, USA., 1967.
[Pelleg & Moores,2000] Pelleg, Dan, and Andrew W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters, ICML. Vol. 1. 2000.
[Elkan, 2003] Elkan, Charles: Using the triangle inequality to acceler- ate k-means, In Tom Fawcett and Nina Mishra, editors, ICML, pages Vol.3, 147-153. AAAI Press, 2003.
[Lloyd, 1982] Lloyd, S.: Least squares quantization in PCM, IEEE transactions on information theory, Vol. 28(2), pp. 129-137. 1982.
[Leisch, 2006] Leisch, F.: A toolbox for k-centroids cluster analysis, Computational Statistics & Data Analysis, Vol. 51(2), pp. 526-544. 2006.
[Arthur & Vassilvitskii] Arthur, David, and Vassilvitskii, Sergei: K-means++ the advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007
[Witten/Tibshirani, 2010] Witten, D. and Tibshirani, R.: A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, Vol. 105(490), pp. 713-726, 2010.
[Hamerly, 2010] Hamerly, Greg: Making k-means even faster, Proceedings of the 2010 SIAM international conference on data mining, Society for Industrial and Applied Mathematics, pp. 130-140, 2010.
[Szepannek, 2018] Szepannek, G.: clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal, Vol. 10/2, pp. 200-208, doi:10.32614/RJ2018048, 2018.
[Curtin, 2017] Curtin, Ryan R: A dual-tree algorithm for fast k-means clustering with large k, Proceedings of the 2017 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2017.
data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) data('Leukemia') # As expected does not perform well # For non-spherical cluster structures: out=kmeansClustering(Leukemia$DistanceMatrix,ClusterNo=6,RandomNo =10,PlotIt=TRUE) data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE,Type="Steinley") data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo = 7, Type = "kprototypes",CategoricalData = as.matrix(Hepta$Cls))
data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) data('Leukemia') # As expected does not perform well # For non-spherical cluster structures: out=kmeansClustering(Leukemia$DistanceMatrix,ClusterNo=6,RandomNo =10,PlotIt=TRUE) data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE,Type="Steinley") data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo = 7, Type = "kprototypes",CategoricalData = as.matrix(Hepta$Cls))
Perform k-means clustering on a distance matrix
kmeansDist(Distance, ClusterNo=2,Centers=NULL, RandomNo=1,maxIt = 2000, PlotIt=FALSE,verbose = F)
kmeansDist(Distance, ClusterNo=2,Centers=NULL, RandomNo=1,maxIt = 2000, PlotIt=FALSE,verbose = F)
Distance |
Distance matrix. For n data points of the dimension n x n |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Centers |
Default(NULL) a set of initial (distinct) cluster centres. |
RandomNo |
If>1: Number of random initializations with searching for minimal SSE is defined by this scalar |
maxIt |
Optional: Maximum number of iterations before the algorithm terminates. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
verbose |
Optional: Algorithm always outputs current iteration. |
Cls[1:n] |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
centerids[1:k] |
Indices of the centroids from which the cluster Cls was created |
Currently an experimental version
Felix Pape, Michael Thrun
data('Hepta') #out=kmeansDist(as.matrix(dist(Hepta$Data)),ClusterNo=7,PlotIt=FALSE,RandomNo = 10) ## Not run: data('Leukemia') #as expected does not perform well #for non-spherical cluster structures: #out=kmeansDist(Leukemia$DistanceMatrix,ClusterNo=6,PlotIt=TRUE,RandomNo=10) ## End(Not run)
data('Hepta') #out=kmeansDist(as.matrix(dist(Hepta$Data)),ClusterNo=7,PlotIt=FALSE,RandomNo = 10) ## Not run: data('Leukemia') #as expected does not perform well #for non-spherical cluster structures: #out=kmeansDist(Leukemia$DistanceMatrix,ClusterNo=6,PlotIt=TRUE,RandomNo=10) ## End(Not run)
Clustering Large Applications (clara) of [Rousseeuw/Kaufman, 1990, pp. 126-163]
LargeApplicationClustering(Data, ClusterNo, PlotIt=FALSE,Standardization=TRUE,Samples=50,Random=TRUE,...)
LargeApplicationClustering(Data, ClusterNo, PlotIt=FALSE,Standardization=TRUE,Samples=50,Random=TRUE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Standardization |
|
Samples |
Integer, say N, the number of samples to be drawn from the dataset. Default value set as recommended by documentation of |
Random |
Logical indicating if R's random number generator should be used instead of the primitive clara()-builtin one. |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
It is recommended to use set.seed
if clustering output should be always the same instead of setting Random=FALSE in order to use the primitive clara()-builtin random number generator.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, doi 10.1002/9780470316801, Online ISBN: 9780470316801, 1990.
data('Hepta') out=LargeApplicationClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=LargeApplicationClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Data is anonymized. Original dataset was published in [Haferlach et al., 2010]. Original dataset had around 12.000 dimensions. Detailed description of preprocessed dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Leukemia")
data("Leukemia")
554x554 distance matrix. Cls defines the following clusters:
1= APL Outlier
2=APL
3=Healthy
4=AML
5=CLL
6=CLL Outlier
[Thrun, 2018] Thrun, M. C.: Projection Based Clustering through Self-Organization and Swarm Intelligence, doctoral dissertation 2017, Springer, Heidelberg, ISBN: 978-3-658-20539-3, doi:10.1007/978-3-658-20540-9, 2018.
[Haferlach et al., 2010] Haferlach, T., Kohlmann, A., Wieczorek, L., Basso, G., Te Kronnie, G., Bene, M.-C., . . . Mills, K. I.: Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group, Journal of Clinical Oncology, Vol. 28(15), pp. 2529-2537. 2010.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Leukemia) str(Leukemia) Cls=Leukemia$Cls Distance=Leukemia$DistanceMatrix isSymmetric(Distance)
data(Leukemia) str(Leukemia) Cls=Leukemia$Cls Distance=Leukemia$DistanceMatrix isSymmetric(Distance)
Clearly defined clusters, different variances. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Lsun3D")
data("Lsun3D")
Size 404, Dimensions 3
Dataset defines discontinuites, where the clusters have different variances. Three main clusters, and four outliers (in cluster 4). For a more detailed description see [Thrun, 2018].
[Thrun, 2018] Thrun, M. C.: Projection Based Clustering through Self-Organization and Swarm Intelligence, doctoral dissertation 2017, Springer, Heidelberg, ISBN: 978-3-658-20539-3, doi:10.1007/978-3-658-20540-9, 2018.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Lsun3D) str(Lsun3D) Cls=Lsun3D$Cls Data=Lsun3D$Data
data(Lsun3D) str(Lsun3D) Cls=Lsun3D$Cls Data=Lsun3D$Data
Graph clustering algorithm introduced by [van Dongen, 2000].
MarkovClustering(DataOrDistances=NULL,Adjacency=NULL, Radius=TRUE,DistanceMethod="euclidean",addLoops = TRUE,PlotIt=FALSE,...)
MarkovClustering(DataOrDistances=NULL,Adjacency=NULL, Radius=TRUE,DistanceMethod="euclidean",addLoops = TRUE,PlotIt=FALSE,...)
DataOrDistances |
NULL or: Either [1:n,1:n] symmetric distance matrix or [1:n,1:d] not symmetric data matrix of n cases and d variables |
Adjacency |
Used if |
Radius |
Scalar, Radius for unit disk graph (r-ball graph) if adjacency matrix is missing. Automatic estimation can be done either with =TRUE [Ultsch, 2005] or FALSE [Thrun et al., 2016] if Data instead of Distances are given. |
DistanceMethod |
Optional distance method of data, default is euclid, see |
addLoops |
Logical; if TRUE, self-loops with weight 1 are added to each vertex of x (see |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
DataOrDistances
is used to compute the Adjecency
matrix if this input is missing. Then a unit-disk (R-ball) graph is calculated.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[van Dongen, 2000] van Dongen, S.M. Graph Clustering by Flow Simulation. Ph.D. thesis, Universtiy of Utrecht. Utrecht University Repository: http://dspace.library.uu.nl/handle/1874/848, 2000
[Thrun et al., 2016] Thrun, M. C., Lerch, F., Loetsch, J., & Ultsch, A. : Visualization and 3D Printing of Multivariate Data of Biomarkers, in Skala, V. (Ed.), International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Vol. 24, Plzen, 2016.
[Ultsch, 2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, In Baier, D. & Werrnecke, K. D. (Eds.), Innovations in classification, data science, and information systems, (Vol. 27, pp. 91-100), Berlin, Germany, Springer, 2005.
data('Hepta') out=MarkovClustering(Data=Hepta$Data,PlotIt=FALSE)
data('Hepta') out=MarkovClustering(Data=Hepta$Data,PlotIt=FALSE)
Mean Shift Clustering of [Cheng, 1995]
MeanShiftClustering(Data, PlotIt=FALSE,...)
MeanShiftClustering(Data, PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
the radius used for search can be specified with the "radius
" parameter. The maximum number of iterations before algorithm termination is controlled with the "max_iterations
" parameter.
If the distance between two centroids is less than the given radius, one will be removed. A radius of 0 or less means an estimate will be calculated and used for the radius. Default value "0" (numeric).
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Cheng, 1995] Cheng, Yizong: Mean Shift, Mode Seeking, and Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17 (8), pp. 790-799, doi:10.1109/34.400568, 1995.
data('Hepta') out=MeanShiftClustering(Hepta$Data,PlotIt=FALSE,radius=1)
data('Hepta') out=MeanShiftClustering(Hepta$Data,PlotIt=FALSE,radius=1)
Hierchical Clustering using the minimal energy approach of [Szekely/Rizzo, 2005].
MinimalEnergyClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,Data,...)
MinimalEnergyClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,Data,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
DistanceMethod |
See |
ColorTreshold |
Draws cutline w.r.t. dendogram y-axis (height), height of line as scalar should be given |
Data |
[1:n,1:d] data matrix in the case that |
... |
In case of plotting further argument for |
List of
Cls |
If ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
[Szekely/Rizzo, 2005] Szekely, G. J. and Rizzo, M. L.: Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method, Journal of Classification, 22(2) 151-183.http://dx.doi.org/10.1007/s00357-005-0012-9, 2005.
data('Hepta') out=MinimalEnergyClustering(Hepta$Data,ClusterNo=7)
data('Hepta') out=MinimalEnergyClustering(Hepta$Data,ClusterNo=7)
In the minimax linkage hierarchical clustering every cluster has an associated prototype element that represents that cluster [Bien/Tibshirani, 2011].
MinimaxLinkageClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,...)
MinimaxLinkageClustering(DataOrDistances, ClusterNo = 0, DistanceMethod="euclidean", ColorTreshold = 0,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be build by the algorithm. |
DistanceMethod |
See |
ColorTreshold |
Draws cutline w.r.t. dendogram y-axis (height), height of line as scalar should be given |
... |
In case of plotting further argument for |
List of
Cls |
If, ClusterNo>0: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Otherwise for ClusterNo=0: NULL |
Dendrogram |
Dendrogram of hierarchical clustering algorithm |
Object |
Ultrametric tree of hierarchical clustering algorithm |
Michael Thrun
[Bien/Tibshirani, 2011] Bien, J., and Tibshirani, R.: Hierarchical Clustering with Prototypes via Minimax Linkage, The Journal of the American Statistical Association, Vol. 106(495), pp. 1075-1084, 2011.
data('Hepta') out=MinimaxLinkageClustering(Hepta$Data,ClusterNo=7)
data('Hepta') out=MinimaxLinkageClustering(Hepta$Data,ClusterNo=7)
Calls Model based clustering of [Fraley/Raftery, 2006] which models a Mixture Of Gaussians (MoG).
ModelBasedClustering(Data,ClusterNo=2,PlotIt=FALSE,...)
ModelBasedClustering(Data,ClusterNo=2,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
see [Thrun, 2017, p. 23] or [Fraley/Raftery, 2002] and [Fraley/Raftery, 2006].
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
MoGclustering used in [Thrun, 2017] was renamed to ModelBasedClustering
in this package.
Michael Thrun
[Thrun, 2017] Thrun, M. C.:A System for Projection Based Clustering through Self-Organization and Swarm Intelligence, (Doctoral dissertation), Philipps-Universitaet Marburg, Marburg, 2017.
[Fraley/Raftery, 2002] Fraley, C., and Raftery, A. E.: Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, Vol. 97(458), pp. 611-631. 2002.
[Fraley/Raftery, 2006] Fraley, C., and Raftery, A. E.MCLUST version 3: an R package for normal mixture modeling and model-based clustering,DTIC Document, 2006.
data('Hepta') out=ModelBasedClustering(Hepta$Data,PlotIt=FALSE)
data('Hepta') out=ModelBasedClustering(Hepta$Data,PlotIt=FALSE)
Model-based clustering with variable selection and estimation of the number of clusters which is either based on [Marbac/Sedki, 2017],[Marbac et al., 2020], or on [Scrucca and Raftery, 2014].
ModelBasedVarSelClustering(Data,ClusterNo,Type,PlotIt=FALSE, ...)
ModelBasedVarSelClustering(Data,ClusterNo,Type,PlotIt=FALSE, ...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
Numeric which defines number of cluster to search for. |
Type |
String, either |
PlotIt |
(optional) Boolean. Default = FALSE = No plotting performed. |
... |
Further arguments passed on to VarSelCluster or clustvarsel. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Quirin Stier, Michael Thrun
[Marbac/Sedki, 2017] Marbac, M. and Sedki, M.: Variable selection for model-based clustering using the integrated complete-data likelihood. Statistics and Computing, 27(4), pp. 1049-1063, 2017.
[Marbac et al., 2020] Marbac, M., Sedki, M., & Patin, T.: Variable selection for mixed data clustering: application in human population genomics, Journal of Classification, Vol. 37(1), pp. 124-142. 2020.
# Hepta data("Hepta") Data = Hepta$Data V = ModelBasedVarSelClustering(Data, ClusterNo=7,Type="VarSelLCM") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls, K = 7) V = ModelBasedVarSelClustering(Data, ClusterNo=7,Type="clustvarsel") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls, K = 7) ## Not run: # Hearts heart=VarSelLCM::heart ztrue <- heart[,"Class"] Data <- heart[,-13] V <- ModelBasedVarSelClustering(Data,2,Type="VarSelLCM") Cls = V$Cls ClusterAccuracy(ztrue, Cls, K = 2) ## End(Not run)
# Hepta data("Hepta") Data = Hepta$Data V = ModelBasedVarSelClustering(Data, ClusterNo=7,Type="VarSelLCM") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls, K = 7) V = ModelBasedVarSelClustering(Data, ClusterNo=7,Type="clustvarsel") Cls = V$Cls ClusterAccuracy(Hepta$Cls, Cls, K = 7) ## Not run: # Hearts heart=VarSelLCM::heart ztrue <- heart[,"Class"] Data <- heart[,-13] V <- ModelBasedVarSelClustering(Data,2,Type="VarSelLCM") Cls = V$Cls ClusterAccuracy(ztrue, Cls, K = 2) ## End(Not run)
MixtureOfGaussians (MoG) clustering based on Expectation Maximization (EM) of [Chen et al., 2012] or algorithms closely resembling EM of [Benaglia/Chauveau/Hunter, 2009].
MoGclustering(Data,ClusterNo=2,Type,PlotIt=FALSE,Silent=TRUE,...)
MoGclustering(Data,ClusterNo=2,Type,PlotIt=FALSE,Silent=TRUE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
string defining approach to select: initialization approach of "EM" or "kmeans" of [Chen et al., 2012], or other methods "mvnormalmixEM" [McLachlan/Peel, 2000], "npEM"[Benaglia et al., 2009] or its extension "mvnpEM" [Chauveau/Hoang, 2016]. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Silent |
(optional) Boolean: print output or not (Default = FALSE = no output) |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used, see package mixtools EMCluster or mixtools for details. |
Algorithms for clustering through EM or its close resembles.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
MoG used in [Thrun, 2017] was renamed to ModelBasedClustering
in this package. Type="mvnormalmixEM"
sometimes fails
Michael Thrun
[Chen et al., 2012] Chen, W., Maitra, R., & Melnykov, V.: EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution, R Package, URL http://cran. r-project. org/package= EMCluster, 2012.
[Chauveau/Hoang, 2016] Chauveau, D., & Hoang, V. T. L.: Nonparametric mixture models with conditionally independent multivariate component densities, Computational Statistics & Data Analysis, Vol. 103, pp. 1-16. 2016.
[Benaglia et al., 2009] Benaglia, T., Chauveau, D., and Hunter, D. R.: An EM-like algorithm for semi-and nonparametric estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18(2), pp. 505-526, 2009.
[McLachlan/Peel, 2000] D. McLachlan, G. J. and Peel, D.: Finite Mixture Models, John Wiley and Sons, Inc, 2000.
data('Hepta') Data = Hepta$Data out=MoGclustering(Data,ClusterNo=7,Type="EM",PlotIt=FALSE) V=out$Cls V1 = MoGclustering(Data,ClusterNo=7,Type="mvnpEM") Cls1 = V1$Cls V2 = MoGclustering(Data,ClusterNo=7,Type="npEM") Cls2 = V2$Cls ## Not run: #does not work always V3 = MoGclustering(Data,ClusterNo=7,Type="mvnormalmixEM") Cls3 = V3$Cls ## End(Not run)
data('Hepta') Data = Hepta$Data out=MoGclustering(Data,ClusterNo=7,Type="EM",PlotIt=FALSE) V=out$Cls V1 = MoGclustering(Data,ClusterNo=7,Type="mvnpEM") Cls1 = V1$Cls V2 = MoGclustering(Data,ClusterNo=7,Type="npEM") Cls2 = V2$Cls ## Not run: #does not work always V3 = MoGclustering(Data,ClusterNo=7,Type="mvnormalmixEM") Cls3 = V3$Cls ## End(Not run)
Performs the MST-kNN clustering algorithm which generate a clustering solution with automatic k determination using two proximity graphs: Minimal Spanning Tree (MST) and k-Nearest Neighbor (kNN) which are recursively intersected.
MSTclustering(DataOrDistances, DistanceMethod = "euclidean",PlotIt=FALSE, ...)
MSTclustering(DataOrDistances, DistanceMethod = "euclidean",PlotIt=FALSE, ...)
DataOrDistances |
Either [1:n,1:n] symmetric distance matrix or [1:n,1:d] not symmetric data matrix of n cases and d variables |
DistanceMethod |
Optional distance method of data, default is euclid, see |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Optional, further arguments for |
Does not work on Hepta with euclidean distances.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Inostroza-Ponta, 2008] Inostroza-Ponta, M.: An integrated and scalable approach based on combinatorial optimization techniques for the analysis of microarray data, University of Newcastle, ISBN, 2008
data(Hepta) MSTclustering(Hepta$Data)
data(Hepta) MSTclustering(Hepta$Data)
Either leiden [Traag et al., 2019] or louvain [Blondel et al., 2008] clustering
NetworkClustering(DataOrDistances=NULL,Adjacency=NULL, Type="louvain",Radius=FALSE,PlotIt=FALSE,...)
NetworkClustering(DataOrDistances=NULL,Adjacency=NULL, Type="louvain",Radius=FALSE,PlotIt=FALSE,...)
DataOrDistances |
NULL or: [1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
Adjacency |
Used if |
Type |
Either "louvain" or "leiden" |
Radius |
Scalar, Radius for unit disk graph (r-ball graph) if adjacency matrix is missing. Automatic estimation can be done either with =TRUE [Ultsch, 2005] or FALSE [Thrun et al., 2016] |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
DataOrDistances
is used to compute the Adjecency
matrix if this input is missing. Then a unit-disk (R-ball) graph is calculated.
Radius=TRUE
only works if data matrix is given.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
leiden requires igraph package and an installed python version. automatic installation may not work. manual call in console has to be in this case conda install -c conda-forge leidenalg
Michael Thrun
[Blondel et al., 2008] Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E.: Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment, Vol. 2008(10), pp. P10008. 2008.
[Traag et al., 2019] Traag, V. A., Waltman, L., & van Eck, N. J.: From Louvain to Leiden: guaranteeing well-connected communities, Scientific reports, Vol. 9(1), pp. 1-12. 2019.
data('Hepta') #out=NetworkClustering(Hepta$Data,PlotIt=FALSE)
data('Hepta') #out=NetworkClustering(Hepta$Data,PlotIt=FALSE)
Neural gas clustering published by [Martinetz et al., 1993]] and implemented by [Bodenhofer et al., 2011].
NeuralGasClustering(Data, ClusterNo,PlotIt=FALSE,...)
NeuralGasClustering(Data, ClusterNo,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Dimitriadou, 2002] Dimitriadou, E.: cclust-convex clustering methods and clustering indexes. R package, 2002,
[Martinetz et al., 1993] Martinetz, T. M., Berkovich, S. G., & Schulten, K. J.: 'Neural-gas' network for vector quantization and its application to time-series prediction, IEEE Transactions on Neural Networks, Vol. 4(4), pp. 558-569. 1993.
data('Hepta') out=NeuralGasClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=NeuralGasClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
OPTICS (Ordering points to identify the clustering structure) clustering algorithm [Ankerst et al.,1999].
OPTICSclustering(Data, MaxRadius,RadiusThreshold, minPts = 5, PlotIt=FALSE,...)
OPTICSclustering(Data, MaxRadius,RadiusThreshold, minPts = 5, PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
MaxRadius |
Upper limit neighborhood in the R-ball graph/unit disk graph), size of the epsilon neighborhood (eps) [Ester et al., 1996, p. 227]. If NULL, automatic estimation is done using insights of [Ultsch, 2005]. |
RadiusThreshold |
Threshold to identify clusters (RadiusThreshold <= MaxRadius), if NULL |
minPts |
Number of minimum points in the eps region (for core points). In principle minimum number of points in the unit disk, if the unit disk is within the cluster (core) [Ester et al., 1996, p. 228]. If NULL, its 2.5 percent of points. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
...
List of
Cls |
[1:n] numerical vector defining the clustering; this classification is the main output of the algorithm. Points which cannot be assigned to a cluster will be reported as members of the noise cluster with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Ankerst et al.,1999] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander: OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD international conference on Management of data, ACM Press, pp. 49-60, 1999.
[Ester et al., 1996] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. Kdd, Vol. 96, pp. 226-231, 1996.
[Ultsch, 2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, In Baier, D. & Werrnecke, K. D. (Eds.), Innovations in classification, data science, and information systems, (Vol. 27, pp. 91-100), Berlin, Germany, Springer, 2005.
data('Hepta') out=OPTICSclustering(Hepta$Data,MaxRadius=NULL,RadiusThreshold=NULL,minPts=NULL,PlotIt = FALSE)
data('Hepta') out=OPTICSclustering(Hepta$Data,MaxRadius=NULL,RadiusThreshold=NULL,minPts=NULL,PlotIt = FALSE)
Partitioning (clustering) of the data into k clusters around medoids, a more robust version of k-means [Rousseeuw/Kaufman, 1990, p. 68-125] .
PAMclustering(DataOrDistances,ClusterNo, PlotIt=FALSE,Standardization=TRUE,Data,...)
PAMclustering(DataOrDistances,ClusterNo, PlotIt=FALSE,Standardization=TRUE,Data,...)
DataOrDistances |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. Alternatively, symmetric [1:n,1:n] distance matrix |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Standardization |
|
Data |
[1:n,1:d] data matrix in the case that |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
[Rousseeuw/Kaufman, 1990, chapter 2] or [Reynolds et al., 1992].
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, doi:10.1002/9780470316801, Online ISBN: 9780470316801, 1990.
[Reynolds et al., 1992] Reynolds, A., Richards, G.,de la Iglesia, B. and Rayward-Smith, V.: Clustering rules: A comparison of partitioning and hierarchical clustering algorithms, Journal of Mathematical Modelling and Algorithms 5, 475-504, DOI:10.1007/s10852-005-9022-1, 1992.
data('Hepta') out=PAMclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=PAMclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Clustering via non parametric density estimation
pdfClustering(Data, PlotIt = FALSE, ...)
pdfClustering(Data, PlotIt = FALSE, ...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
Cluster analysis is performed by the density-based procedures described in Azzalini and Torelli (2007) and Menardi and Azzalini (2014), and summarized in Azzalini and Menardi (2014).
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
Azzalini, A., Menardi, G. (2014). Clustering via nonparametric density estimation: the R package pdfCluster. Journal of Statistical Software, 57(11), 1-26, URL http://www.jstatsoft.org/v57/i11/.
Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation. Statistics and Computing. 17, 71-80.
Menardi, G., Azzalini, A. (2014). An advancement in clustering via nonparametric density estimation. Statistics and Computing. DOI: 10.1007/s11222-013-9400-x.
data('Hepta') out=pdfClustering(Hepta$Data,PlotIt=FALSE)
data('Hepta') out=pdfClustering(Hepta$Data,PlotIt=FALSE)
Clustering is performed through penalized regression with grouping pursuit
PenalizedRegressionBasedClustering(Data, FirstLambda, SecondLambda, Tau, PlotIt = FALSE, ...)
PenalizedRegressionBasedClustering(Data, FirstLambda, SecondLambda, Tau, PlotIt = FALSE, ...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
FirstLambda |
Set 1 for quadratic penalty based algorithm, 0.4 for revised ADMM. |
SecondLambda |
The magnitude of grouping penalty. |
Tau |
Tuning parameter: tau, related to grouping penalty. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments for |
Parameters are rather challenging to choose.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Data matrix is internally transposed in order to fit the definition of the algorithm.
Michael Thrun
[Pan et al., 2013] Pan, W., Shen, X., & Liu, B.: Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty, The Journal of Machine Learning Research, Vol. 14(1), pp. 1865-1889. 2013.
[Wu et al., 2016] Wu, C., Kwon, S., Shen, X., & Pan, W.: A new algorithm and theory for penalized regression-based clustering, The Journal of Machine Learning Research, Vol. 17(1), pp. 6479-6503. 2016.
data(Hepta) Data=Hepta$Data out=PenalizedRegressionBasedClustering(Data,0.4,1,2,PlotIt=FALSE) table(out$Cls,Hepta$Cls)
data(Hepta) Data=Hepta$Data out=PenalizedRegressionBasedClustering(Data,0.4,1,2,PlotIt=FALSE) table(out$Cls,Hepta$Cls)
Summarizes recent projection pursuit methods for clustering based on [Hofmeyr/Pavlidis, 2015], [Hofmeyr, 2016] and [Pavlidis et al., 2016] .
ProjectionPursuitClustering(Data,ClusterNo,Type="MinimumDensity", PlotIt=FALSE,PlotSolution=FALSE,...)
ProjectionPursuitClustering(Data,ClusterNo,Type="MinimumDensity", PlotIt=FALSE,PlotSolution=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
Either
|
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
PlotSolution |
Plots the partioning solution as a tree as described in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
The details of the options for projection pursuit and partioning of data are defined in [Hofmeyr/Pavlidis, 2019].
"KernelPCA" uses additionally the package kernlab and is implemented as given in the fifth example on page 21, section "extension" of [Hofmeyr/Pavlidis, 2019].
The first idea of using non-PCA projections for clustering was published by [Bock, 1987] as an definition. However, to the knowledge of the author it was not applied to any data. The first systematic comparison to Projection-Pursuit Methods ProjectionPursuitClustering
and AutomaticProjectionBasedClustering
can be found in [Thrun/Ultsch, 2018]. For PCA-based clustering methods please see TandemClustering
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Hofmeyr/Pavlidis, 2015] Hofmeyr, D., & Pavlidis, N.: Maximum clusterability divisive clustering, Proc. 2015 IEEE Symposium Series on Computational Intelligence, pp. 780-786, IEEE, 2015.
[Hofmeyr/Pavlidis, 2019] Hofmeyr, D., & Pavlidis, N.: PPCI: an R Package for Cluster Identification using Projection Pursuit, The R Journal, 2019.
[Hofmeyr, 2016] Hofmeyr, D. P.: Clustering by minimum cut hyperplanes, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39(8), pp. 1547-1560. 2016.
[Pavlidis et al., 2016] Pavlidis, N. G., Hofmeyr, D. P., & Tasoulis, S. K.: Minimum density hyperplanes, The Journal of Machine Learning Research, Vol. 17(1), pp. 5414-5446. 2016.
[Thrun/Ultsch, 2018] Thrun, M. C., & Ultsch, A.: Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, Journal of Classification, Vol. in revision, 2018.
[Bock, 1987] Bock, H.: On the interface between cluster analysis, principal component analysis, and multidimensional scaling, Multivariate statistical modeling and data analysis, (pp. 17-34), Springer, 1987.
data('Hepta') out=ProjectionPursuitClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=ProjectionPursuitClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Stochastic quality clustering of [Heyer et al., 1999] with an improved implementation by [Scharl/Leisch, 2006].
QTclustering(Data,Radius,PlotIt=FALSE,...)
QTclustering(Data,Radius,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
Radius |
Maximum radius of clusters. If NULL, automatic estimation can be done with [Thrun et al., 2016] if not otherwise set. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Heyer et al., 1999] Heyer, L. J., Kruglyak, S., & Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes, Genome research, Vol. 9(11), pp. 1106-1115. 1999.
[Scharl/Leisch, 2006] Scharl, T., & Leisch, F.: The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data, in Rizzi , A. & Vichi, M. (eds.), Proc. Proceedings in Computational Statistics (Compstat), pp. 1015-1022, Physica Verlag, Heidelberg, Germany, 2006.
[Thrun et al., 2016] Thrun, M. C., Lerch, F., Loetsch, J., & Ultsch, A. : Visualization and 3D Printing of Multivariate Data of Biomarkers, in Skala, V. (Ed.), International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Vol. 24, Plzen, 2016.
[Ultsch, 2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, In Baier, D. & Werrnecke, K. D. (Eds.), Innovations in classification, data science, and information systems, (Vol. 27, pp. 91-100), Berlin, Germany, Springer, 2005.
data('Hepta') out=QTclustering(Hepta$Data,Radius=NULL,PlotIt=FALSE)
data('Hepta') out=QTclustering(Hepta$Data,Radius=NULL,PlotIt=FALSE)
Robust Trimmed Clustering invented by [Garcia-Escudero et al., 2008] and implemented by [Fritz et al., 2012].
RobustTrimmedClustering(Data, ClusterNo, Alpha=0.05,PlotIt=FALSE,...)
RobustTrimmedClustering(Data, ClusterNo, Alpha=0.05,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Alpha |
No trimming is done equals to alpha =0, otherwise proportion of datapoints to be trimmed, |
... |
Further arguments to be set for the clustering algorithm, e.g. , |
"This iterative algorithm initializes k clusters randomly and performs "concentration steps" in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by iter.max. For approximately obtaining the global optimum, the system is initialized nstart times and concentration steps are performed until convergence or iter.max is reached. When processing more complex data sets higher values of nstart and iter.max have to be specified (obviously implying extra computation time). ... The larger restr.fact
is chosen, the looser is the restriction on the scatter matrices, allowing for more heterogeneity among the clusters. On the contrary, small values of restr.fact close to 1 imply very equally scattered clusters. This idea of constraining cluster scatters to avoid spurious solutions goes back to Hathaway (1985), who proposed it in mixture fitting problems" [Fritz et al., 2012]. The type of constraint restr
can be set to "eigen", "deter" or "sigma.". Please see tclust
for further parameter description.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Garcia-Escudero et al., 2008] Garcia-Escudero, L. A., Gordaliza, A., Matran, C., & Mayo-Iscar, A.: A general trimming approach to robust cluster analysis, The annals of Statistics, Vol. 36(3), pp. 1324-1345. 2008.
[Fritz et al., 2012] Fritz, H., Garcia-Escudero, L. A., & Mayo-Iscar, A.: tclust: An R package for a trimming approach to cluster analysis, Journal of statistical Software, Vol. 47(12), pp. 1-26. 2012.
data('Hepta') out=RobustTrimmedClustering(Hepta$Data,ClusterNo=7,Alpha=0,PlotIt=FALSE)
data('Hepta') out=RobustTrimmedClustering(Hepta$Data,ClusterNo=7,Alpha=0,PlotIt=FALSE)
Either the variant k-batch or k-online is possible in which every unit can be seen approximately as an cluster.
SOMclustering(Data,LC=c(1,2),ClusterNo=NULL, Mode="online",PlotIt=FALSE,rlen=100,alpha = c(0.05, 0.01),...)
SOMclustering(Data,LC=c(1,2),ClusterNo=NULL, Mode="online",PlotIt=FALSE,rlen=100,alpha = c(0.05, 0.01),...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
LC |
Lines and Columns of a very small SOM, usually every unit is a cluster, will be ignored if ClusterNo is not NULL. |
ClusterNo |
Optional, A number k which defines k different clusters to be built by the algorithm. LC will then be set accordingly. |
Mode |
Either "batch" or "online" |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
rlen |
Please see |
alpha |
Please see |
... |
Further arguments to be set for the clustering algorithm in
|
This clustering algorithm is based on very small maps and, hence, not emergent (c.f. [Thrun, 2018, p.37]). A 3x3 map means 9 units leading to 9 clusters.
Batch is a deterministic clustering approach whereas online is a stochastic clustering approach and research indicates that online should be preferred (c.f. [Thrun, 2018, p.37]).
List of
Cls |
[1:n] numerical vector defining the classification as the main output of the clustering algorithm |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Wherens, Buydens, 2017] R. Wehrens and L.M.C. Buydens, J. Stat. Softw. 21 (5), 2007; R. Wehrens and J. Kruisselbrink, submitted, 2017.
[Thrun, 2018] Thrun, M.C., Projection Based Clustering through Self-Organization and Swarm Intelligence. 2018, Heidelberg: Springer.
data('Hepta') out=SOMclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=SOMclustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Self-organizing Tree Algorithm (SOTA) introduced by [Herrero et al., 2001].
SOTAclustering(Data, ClusterNo,PlotIt=FALSE,UnrestGrowth,...)
SOTAclustering(Data, ClusterNo,PlotIt=FALSE,UnrestGrowth,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
UnrestGrowth |
TRUE: forces the |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
sotaObject |
Object defined by clustering algorithm as the other output of this algorithm |
*Luis Winckelman intergrated several function from clValid because it's ORPHANED.
Luis Winckelmann*, Vasyl Pihur, Guy Brock, Susmita Datta, Somnath Datta
[Herrero et al., 2001] Herrero, J., Valencia, A., & Dopazo, J.: A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics, Vol. 17(2), pp. 126-136. 2001.
#Does Work data('Hepta') out=SOTAclustering(Hepta$Data,ClusterNo=7) table(Hepta$Cls,out$Cls) #Does not work well data('Lsun3D') out=SOTAclustering(Lsun3D$Data,ClusterNo=100,PlotIt=FALSE,UnrestGrowth=FALSE)
#Does Work data('Hepta') out=SOTAclustering(Hepta$Data,ClusterNo=7) table(Hepta$Cls,out$Cls) #Does not work well data('Lsun3D') out=SOTAclustering(Lsun3D$Data,ClusterNo=100,PlotIt=FALSE,UnrestGrowth=FALSE)
Implements the sparse clustering methods of [Witten/Tibshirani, 2010].
SparseClustering(DataOrDistances, ClusterNo, Type="Hierarchical", PlotIt=F,Silent=FALSE, NoPerms=10,Wbounds, ...)
SparseClustering(DataOrDistances, ClusterNo, Type="Hierarchical", PlotIt=F,Silent=FALSE, NoPerms=10,Wbounds, ...)
DataOrDistances |
Either a [1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. or a [1:n,1:n] symmetric distance matrix. |
ClusterNo |
Numeric indicating number to cluster to find in Tree/ Dendrogramm in case of Type="Hierachical" or numer of cluster to use in Type="kmeans" |
Type |
(optional) Char selecting methods Hierarchical or kmeans. Default: "Hierarchical" |
PlotIt |
(optional) Boolean. Default = FALSE = No plotting performed. |
Silent |
(optional) Boolean: print output or not (Default = FALSE = no output) |
NoPerms |
(optional), numeric scalar, Number of permutations. |
Wbounds |
(optional) numeric vector, range of tuning parameters to consider. This is the L1 bound on w, the feature weights [Witten/Tibshirani, 2010]. |
... |
Further arguments passed on to sparcl HierarchicalSparseCluster or KMeansSparseCluster depending on |
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Tree |
Object Tree if Type="Hierachical" is used. |
Quality of clustering results varies between sparse hierarchical if data is given in comparison to the case that distances are given.
Quirin Stier, Michael Thrun
[Witten/Tibshirani, 2010] Witten, D. and Tibshirani, R.: A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, Vol. 105(490), pp. 713-726, 2010.
# Hepta data("Hepta") Data = Hepta$Data V1 = SparseClustering(Data, ClusterNo=7, Type="kmeans") Cls1 = V1$Cls V2 = SparseClustering(Data, ClusterNo=7, Type="Hierarchical") Cls2 = V2$Cls InputDistances = parallelDist::parDist(Data, method="euclidean") DistanceMatrix = as.matrix(InputDistances) V3 = SparseClustering(DistanceMatrix, ClusterNo=7, Type="Hierarchical") Cls3 = V3$Cls ## Not run: set.seed(1) Data = matrix(rnorm(100*50),ncol=50) y = c(rep(1,50),rep(2,50)) Data[y==1,1:25] = Data[y==1,1:25]+2 V1 = SparseClustering(Data, ClusterNo=2, Type="kmeans") Cls1 = V1$Cls ## End(Not run)
# Hepta data("Hepta") Data = Hepta$Data V1 = SparseClustering(Data, ClusterNo=7, Type="kmeans") Cls1 = V1$Cls V2 = SparseClustering(Data, ClusterNo=7, Type="Hierarchical") Cls2 = V2$Cls InputDistances = parallelDist::parDist(Data, method="euclidean") DistanceMatrix = as.matrix(InputDistances) V3 = SparseClustering(DistanceMatrix, ClusterNo=7, Type="Hierarchical") Cls3 = V3$Cls ## Not run: set.seed(1) Data = matrix(rnorm(100*50),ncol=50) y = c(rep(1,50),rep(2,50)) Data[y==1,1:25] = Data[y==1,1:25]+2 V1 = SparseClustering(Data, ClusterNo=2, Type="kmeans") Cls1 = V1$Cls ## End(Not run)
Clusters the Data into "ClusterNo" different clusters using the Spectral Clustering method
SpectralClustering(Data, ClusterNo,PlotIt=FALSE,...)
SpectralClustering(Data, ClusterNo,PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
PlotIt |
default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. e.g.:
|
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[Ng et al., 2002] Ng, A. Y., Jordan, M. I., & Weiss, Y.: On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems, Vol. 2, pp. 849-856. 2002.
data('Hepta') out=SpectralClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=SpectralClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Spectrum is a self-tuning spectral clustering method for single or multi-view data. In this wrapper restricted to the standard use in other clustering algorithms.
Spectrum(Data, Type = 2, ClusterNo = NULL, PlotIt = FALSE, Silent = TRUE,PlotResults = FALSE, ...)
Spectrum(Data, Type = 2, ClusterNo = NULL, PlotIt = FALSE, Silent = TRUE,PlotResults = FALSE, ...)
Data |
1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
Type |
Type=1: default eigengap method (Gaussian clusters) Type=2: multimodality gap method (Gaussian/ non-Gaussian clusters) Type=3: Allows to setClusterNo |
ClusterNo |
Optional, A number k which defines k different clusters to be built by the algorithm.
For default |
PlotIt |
Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
Silent |
Silent progress of algorithm=TRUE |
PlotResults |
Plots result of spectrum with plot function |
... |
Method: numerical value: 1 = default eigengap method (Gaussian clusters), 2 = multimodality gap method (Gaussian/ non-Gaussian clusters), 3 = no automatic method (see fixk param) Other parameters defined in Spectrum packages |
Spectrum is a partitioning algorithm and either uses the eigengap or multimodality gap heuristics to determine the number of clusters, please see Spectrum package for details
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[John et al, 2020] John, C. R., Watson, D., Barnes, M. R., Pitzalis, C., & Lewis, M. J.: Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics, Vol. 36(4), pp. 1159-1166, 2020.
data('Hepta') out=Spectrum(Hepta$Data,PlotIt=FALSE) out=Spectrum(Hepta$Data,PlotIt=TRUE)
data('Hepta') out=Spectrum(Hepta$Data,PlotIt=FALSE) out=Spectrum(Hepta$Data,PlotIt=TRUE)
Density estimation for ggplot with a clear model behind it.
The format is: Classes 'StatPDEdensity', 'Stat', 'ggproto' <ggproto object: Class StatPDEdensity, Stat> aesthetics: function compute_group: function compute_layer: function compute_panel: function default_aes: uneval extra_params: na.rm finish_layer: function non_missing_aes: parameters: function required_aes: x y retransform: TRUE setup_data: function setup_params: function super: <ggproto object: Class Stat>
PDE was published in [Ultsch, 2005], short explanation in [Thrun, Ultsch 2018] and the PDE optimized violin plot was published in [Thrun et al., 2018].
[Ultsch,2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, in Baier, D.; Werrnecke, K. D., (Eds), Innovations in classification, data science, and information systems, Proc Gfkl 2003, pp 91-100, Springer, Berlin, 2005.
[Thrun, Ultsch 2018] Thrun, M. C., & Ultsch, A. : Effects of the payout system of income taxes to municipalities in Germany, in Papiez, M. & Smiech,, S. (eds.), Proc. 12th Professor Aleksander Zelias International Conference on Modelling and Forecasting of Socio-Economic Phenomena, pp. 533-542, Cracow: Foundation of the Cracow University of Economics, Cracow, Poland, 2018.
[Thrun et al, 2018] Thrun, M. C., Pape, F., & Ultsch, A. : Benchmarking Cluster Analysis Methods using PDE-Optimized Violin Plots, Proc. European Conference on Data Analysis (ECDA), accepted, Paderborn, Germany, 2018.
Subspace (projected) clustering is a technique which finds clusters within different subspaces (a selection of one or more dimensions).
SubspaceClustering(Data,ClusterNo,DimSubspace, Type='Orclus',PlotIt=FALSE,OrclusInitialClustersNo=ClusterNo+2,...)
SubspaceClustering(Data,ClusterNo,DimSubspace, Type='Orclus',PlotIt=FALSE,OrclusInitialClustersNo=ClusterNo+2,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases or d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the proclus or orclust algorithm. |
DimSubspace |
Numerical number defining the dimensionality in which clusters should be search in in the orclust algorithm, for proclus it is an optional parameter |
Type |
'Orclus', subspace clustering based on arbitrarily oriented projected cluster generation [Aggarwal and Yu, 2000] 'ProClus' ProClus algorithm for subspace clustering [Aggarwal/Wolf, 1999] 'Clique' ProClus algorithm finds subspaces of high-density clusters [Agrawal et al., 1999] and [Agrawal et al., 2005] 'SubClu' SubClu algorithm is a density-connected approach for subspace clustering [Kailing et al.,2004] |
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
OrclusInitialClustersNo |
Only for Orclus algorithm: Initial number of clusters (that are computed in the entire data space) must be greater than k. The number of clusters is iteratively decreased by a factor until the final number of k clusters is reached. |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. For Subclue: "epsilon" and "minSupport", see For Clique: "xi" (number of intervals for each dimension) and "tau" (Density Threshold), see |
Subspace clustering algorithms have the goal to finde one or more subspaces with the assumation that sufficient dimensionality reduction is dimensionality reduction without loss of information. Hence subspace clustering aums at finding a linear subspace sucht that the subspace contains as much predictive information as the input space. The subspace is usually higher than two but lower than the input space. In contrast, projection-based clustering AutomaticProjectionBasedClustering
projects the data (nonlinear) into two dimensions and tries only to preerve relevant neighborhoods.
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
JAVA_HOME has to be set for rJava to the ProClus algorithm (in windows set PATH env. variable to .../bin path of Java. The architecture of R and Java have to match. Java automatically downloads the Java version of the browser which may not be installed in the architecture in R. In such a case choose a Java version manually.
Michael Thrun
[Aggarwal/Wolf et al., 1999] Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S.: Fast algorithms for projected clustering, Proc. ACM SIGMoD Record, Vol. 28, pp. 61-72, ACM, 1999.
[Aggarwal/Yu, 2000] Aggarwal, C. C., & Yu, P. S.: Finding generalized projected clusters in high dimensional spaces, (Vol. 29), ACM, ISBN: 1581132174, 2000.
[Agrawal et al., 1999]: Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, In Proc. ACM SIGMOD, 1999.
[Agrawal et al., 2005] Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P.: Automatic subspace clustering of high dimensional data, Data Mining and Knowledge Discovery, Vol. 11(1), pp. 5-33. 2005.
[Kailing et al.,2004] Kailing, Karin, Hans-Peter Kriegel, and Peer Kroeger: Density-connected subspace clustering for high-dimensional data, Proceedings of the 2004 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2004
data('Hepta') out=SubspaceClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=SubspaceClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Summarizes clustering methods that combine k-means and pca
TandemClustering(Data,ClusterNo,Type="Reduced",PlotIt=FALSE,...)
TandemClustering(Data,ClusterNo,Type="Reduced",PlotIt=FALSE,...)
Data |
[1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features. |
ClusterNo |
A number k which defines k different clusters to be built by the algorithm. |
Type |
|
PlotIt |
Default: FALSE, if TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in |
... |
Further arguments to be set for the clustering algorithm, if not set, default arguments are used. |
If the ClusterNo exceeds the number of dimensions, than the function is called recursively with ClusterNo=2. In each iteration the cluster with the highest number of overall points is clustered again, until the number of clusters is met.
"KernelPCA" uses addtionally the package kernlab and is implemented as given in the fifth example on page 18, section "extension" of [Hofmeyr/Pavlidis, 2019]
The first idea of using non-PCA projections for clustering was published by [Bock, 1987] as an definition. However, to the knowledge of the author it was not applied to any data. The first systematic comparison to Projection-Pursuit Methods ProjectionPursuitClustering
and AutomaticProjectionBasedClustering
can be found in [Thrun/Ultsch, 2018].
List of
Cls |
[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering. Points which cannot be assigned to a cluster will be reported with 0. |
Object |
Object defined by clustering algorithm as the other output of this algorithm |
Michael Thrun
[De Soete/Carroll, 1994] De Soete, G., & Carroll, J. D.: K-means clustering in a low-dimensional Euclidean space, New approaches in classification and data analysis, (pp. 212-219), Springer, 1994.
[Hofmeyr/Pavlidis, 2019] Hofmeyr, D., & Pavlidis, N.: PPCI: an R Package for Cluster Identification using Projection Pursuit, The R Journal, 2019.
[Vichi/Kiers, 2001] Vichi, M., & Kiers, H. A.: Factorial k-means analysis for two-way data, Computational Statistics & Data Analysis, Vol. 37(1), pp. 49-64. 2001.
[Thrun/Ultsch, 2018] Thrun, M. C., & Ultsch, A.: Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, Journal of Classification, Vol. in revision, 2018.
[Bock, 1987] Bock, H.: On the interface between cluster analysis, principal component analysis, and multidimensional scaling, Multivariate statistical modeling and data analysis, (pp. 17-34), Springer, 1987.
data('Hepta') out=TandemClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
data('Hepta') out=TandemClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)
Detailed description of dataset and its clustering challenge of outliers is provided in [Thrun/Ultsch, 2020]
data("Target")
data("Target")
Size 770, Dimensions 2, stored in Target$Data
Classes 6, stored in Target$Cls
[Ultsch, 2005] Ultsch, A.: U* C: Self-organized Clustering with Emergent Feature Maps, Proc. Lernen, Wissensentdeckung und Adaptivitaet (LWA/FGML), pp. 240-244, Saarbruecken, Germany, 2005.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Target) str(Target)
data(Target) str(Target)
Almost touching clusters. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("Tetra")
data("Tetra")
Size 400, Dimensions 3, stored in Tetra$Data
Classes 4, stored in Tetra$Cls
[Ultsch, 1993] Ultsch, A.: Self-organizing neural networks for visualisation and classification, Information and classification, (pp. 307-313), Springer, 1993.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Tetra) str(Tetra)
data(Tetra) str(Tetra)
Cluster border defined by density. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("TwoDiamonds")
data("TwoDiamonds")
Size 800, Dimensions 2, stored in TwoDiamonds$Data
Classes 2, stored in TwoDiamonds$Cls
[Ultsch, 2003a] Ultsch, A.Optimal density estimation in data containing clusters of unknown structure, technical report, Vol. 34,University of Marburg, Department of Mathematics and Computer Science, 2003.
[Ultsch, 2003b] Ultsch, A.: U*-matrix: a tool to visualize clusters in high dimensional data, Fachbereich Mathematik und Informatik, 2003.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(TwoDiamonds) str(TwoDiamonds)
data(TwoDiamonds) str(TwoDiamonds)
Density vs. distance. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data("WingNut")
data("WingNut")
Size 1016, Dimensions 2, stored in WingNut$Data
Classes 2, stored in WingNut$Cls
[Ultsch, 2005] Ultsch, A.: Clustering wih SOM: U* C, Proc. Proceedings of the 5th Workshop on Self-Organizing Maps, Vol. 2, pp. 75-82, 2005.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(WingNut) str(WingNut)
data(WingNut) str(WingNut)