Title: | Interface to OpenSubspace |
---|---|
Description: | An interface to 'OpenSubspace', an open source framework for evaluation and exploration of subspace clustering algorithms in WEKA (see <http://dme.rwth-aachen.de/de/opensubspace> for more information). Also performs visualization. |
Authors: | Marwan Hassani [aut, cre], Matthias Hansen [aut], Emmanuel Müller [ctb], Ira Assent [ctb], Stephan Günnemann [ctb], Timm Jansen [ctb], Thomas Seidl [ctb], University of Waikato [ctb, cph] |
Maintainer: | Marwan Hassani <[email protected]> |
License: | GPL-2 |
Version: | 1.0.4 |
Built: | 2024-11-07 06:33:55 UTC |
Source: | CRAN |
The CLIQUE Algorithm finds clusters by first dividing each dimension into xi equal-width intervals and saving those intervals where the density is greater than tau as clusters. Then each set of two dimensions is examined: If there are two intersecting intervals in these two dimensions and the density in the intersection of these intervals is greater than tau, the intersection is again saved as a cluster. This is repeated for all sets of three, four, five,... dimensions. After every step adjacent clusters are replaced by a joint cluster and in the end all of the clusters are output.
CLIQUE(data, xi = 10, tau = 0.2)
CLIQUE(data, xi = 10, tau = 0.2)
data |
A Matrix of input data. |
xi |
Number of Intervals. |
tau |
Density Threshold. |
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proc. ACM SIGMOD, 1999.
Other subspace.clustering.algorithms: FIRES
;
P3C
; ProClus
;
SubClu
data("subspace_dataset") CLIQUE(subspace_dataset,xi=40,tau=0.06)
data("subspace_dataset") CLIQUE(subspace_dataset,xi=40,tau=0.06)
Reads a File and creates an object of class subspace_clustering from that.
clustering_from_file(file_path, index_starts_at = 0, dim = NULL)
clustering_from_file(file_path, index_starts_at = 0, dim = NULL)
file_path |
The path to the File from which the clustering should be read |
index_starts_at |
The index used in the file to refer to the first object in The data matrix |
dim |
if provided, overrides any value for dim that is found in the first line of the file |
Files must have the following Format: The first line contains the substring "DIM=*dim*;" where *dim* is the number of dimensions of the data set.
Each subsequent line corresponds to a cluster and contains only numbers separated by spaces. The first *dim* of these values have to be either '0' or '1' and indicate in which subspace a cluster exists. All other values in the line have to be the row numbers of the objects that the cluster contains. Row numbers in the file are assumed to be 0-indexed and are changed to 1-indexed as they are loaded into R. This can be changed with the parameter index_starts_at. E.g. a clustering for a three-dimensional dataset with one cluster that is in the first and third dimension and contains the first, second and 1337-th object has to be represented as:
DIM=3;
1 0 1 0 1 1336
Write a Subspace Clustering to Disk
clustering_to_file(clustering, file_path, index_should_start_at = 0)
clustering_to_file(clustering, file_path, index_should_start_at = 0)
clustering |
a subspace clustering object as generated by one of the functions from the subspace package |
file_path |
the path to the file into which the clustering should be written |
index_should_start_at |
the value that is used to refer to the first value in the dataset. |
By default, R uses the value 1 when referring to the first object
in a data frame or array, while most other languages use 0. To make
working with this convention easy, clusterings written to disk are converted
to this 0-indexing System. The standard parameter for the corresponding
function clustering_from_file
is set in such a way that files
read will automatically be converted to 1-indexes, which means that you
should never need to change this parameter if you work exclusively with the
subspace package.
The FIRES Algorithm follows a three phase framework: In a first phase, base-clusters are generated using a clustering-algorithm on each dimension in isolation. Then these base-clusters are merged in a second phase to find multidimensional cluster-approximations. These approximations are then refined in the third phase.
FIRES(data, base_dbscan_epsilon = 1, base_dbscan_minpts = 4, minimumpercent = 25, k = 1, mu = 1, minclu = 1, split = 0.66, post_dbscan_epsilon = 1, post_dbscan_minpts = 1)
FIRES(data, base_dbscan_epsilon = 1, base_dbscan_minpts = 4, minimumpercent = 25, k = 1, mu = 1, minclu = 1, split = 0.66, post_dbscan_epsilon = 1, post_dbscan_minpts = 1)
data |
A matrix or data frame of input data. |
base_dbscan_epsilon |
parameter for the dbscan execution that generates the base clusters |
base_dbscan_minpts |
parameter for the dbscan execution that generates the base clusters |
minimumpercent |
size a base-cluster must have relative to the average size of base clusters so that it is not discarded |
k |
amount of base-clusters that every base-cluster is compared to for merging purposes |
mu |
number of most similar clusters in which two clusters must overlap in order to be considered best-merge-clusters of each other |
minclu |
number of best-merge-candidates a cluster must have to be considered a best-merge-cluster |
split |
a base-cluster is split to merge it with two different clusters iff both clusters resulting from the split have at least this size in proportion to the average size of base-clusters |
post_dbscan_epsilon |
parameter for the dbscan execution that turns the cluster-approximations into the clusters that are output at the end |
post_dbscan_minpts |
parameter for the dbscan execution that turns the cluster-approximations into the clusters that are output at the end |
In this implementation, the first phase consists of a run of DBSCAN with the parameters base_dbscan_epsilon and base_dbscan_minpts on the objects as they appear from each particular dimension. Then, all of these base-clusters whose size is smaller than minimumpercent of the average cluster size, e.g. 25 because they are not likely to contain important information for the clustering.
In the second phase, these base clusters are merged to produce subspace cluster approximations. This is achieved by computing the k-most-similar clusters for each base-cluster. Then the set of best-merge-candidates for each base-cluster is determined, which contains those clusters whose k-most-similar clusters overlap the k-most similar clusters of the cluster by at least mu. If a cluster has at least minclu best-merge-candidates,it is considered a best-merge cluster. Finally, every pair of best-merge-clusters that are best-merge-candidates of each other is grouped together with all of their best-merge-candidates to form the cluster approximations.
Note that some clusters need to be split and merged with two different other clusters. This is done before the merging by determining whether the intersection between a cluster and its most similar cluster as well as the size of the cluster without this intersection are both larger than split times the average size of the base clusters.
In the third phase, base-clusters are repeatedly removed from cluster-approximations if their removal improves the amount of objects that are shared by all base-clusters in the approximation. Finally, to generate the definitive clusters from the cluster approximation, for each approximation all base-clusters in the approximation are combined and the a clustering algorithm is performed on these points. In this implementation, DBSCAN was chosen for this purpose and will be performed with the parameters post_dbscan_epsilon and post_dbscan_minpts.
Hans-Peter Kriegel, Peer Kröger, Matthias Renz and Sebastian Wurst A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data In Proc. 5th IEEE International Conference on Data Mining, 2005
Other subspace.clustering.algorithms: CLIQUE
;
P3C
; ProClus
;
SubClu
data("subspace_dataset") FIRES(subspace_dataset)
data("subspace_dataset") FIRES(subspace_dataset)
The main idea of the P3C algorithm is to use statistical distributions for the task of finding clusters. To this end each dimension is first split into 1+log_2(nrow(data)) bins and the chi-square test is used to compute the probability that the sizes of these bins are uniformly distributed. If this probability is bigger than 1-ChiSquareAlpha, nothing happens. Otherwise the largest bins will be removed until this is the case. The bins that were removed in this way are then used to find clusters. To this end, bins that are adjacent are merged. Then clusters are formed by taking a bin from one dimension and determining the probability of sharing as many points as it does with each bin from another dimension. The bin is then intersected with the bin from another dimension where this probability is the lowest, given that it is at least lower than 1.00E-PoissonThreshold and this is repeated until no such bin is found.
P3C(data, ChiSquareAlpha = 0.005, PoissonThreshold = 19)
P3C(data, ChiSquareAlpha = 0.005, PoissonThreshold = 19)
data |
A Matrix of input data. |
ChiSquareAlpha |
probability of not being uniformly distributed that the points in a dimension are allowed to have without assuming that there is a cluster visible from this dimension |
PoissonThreshold |
maximum probability for a bin in another dimension to deviate from the observed bin as much as it does that is allowed. The value used for this will be 1.00*10^-PoissonThreshold. |
Gabriela Moise, Jörg Sander and Martin Ester P3C: A Robust Projected Clustering Algorithm In Proc. 6th IEEE International Conference on Data Mining 2006
Other subspace.clustering.algorithms: CLIQUE
;
FIRES
; ProClus
;
SubClu
data("subspace_dataset") P3C(subspace_dataset,PoissonThreshold=3)
data("subspace_dataset") P3C(subspace_dataset,PoissonThreshold=3)
Plotting for Subspace clusterings as generated by the package subspace.
Generates a 2d-scatterplot with interactive controls to select the dimensions that should be plotted.
This visualization is created using the ggvis package and is therefore also compatible with shiny.
## S3 method for class 'subspace_clustering' plot(x, data, color_by = "mix", standardcolors = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF", "#000000"), tooltip_on = "hover", ...)
## S3 method for class 'subspace_clustering' plot(x, data, color_by = "mix", standardcolors = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF", "#000000"), tooltip_on = "hover", ...)
x |
an S3-Object of type subspace_clustering as generated by any of the functions of the subspace package |
data |
The original data matrix on which the clustering was performed. |
color_by |
a parameter indicating how a point that is in multiple clusters should be colored. If "mix" is selected, the point will be colored as a mixture of the colors of both of the clusters that the point is in. If "any" is selected, a random color is selected from the colors of all the clusters that the point is in. |
standardcolors |
a vector of strings representing HTML-Colors that will be used to color the points by cluster assignment. Noise will be colored with the last color in the vector. |
tooltip_on |
decides if tooltips should be shown on "hover" or on "click" |
... |
this is passed on to ggvis::layer_points and can be used to change, for example the size of the points |
a ggvis object. If the return value is not used, a plot will be shown, but the returned plot can also be altered using ggvis
When passing ellipsis parameters, the ":=" syntax from ggvis may get in your way, but you can work around this by manually creating a props object as seen in the example.
#Load the example dataset for this package data("subspace_dataset") #Load the true clustering for this dataset path_to_clustering <- paste(path.package("subspace"),"/extdata/subspace_dataset.true",sep="") clustering <- clustering_from_file(file_path=path_to_clustering) #also generate a clustering with one of the algorithms clustering2 <- CLIQUE(subspace_dataset,tau=0.2) #now plot the generated clustering plot(clustering2,subspace_dataset) #plot the true clustering with small points plot(clustering,subspace_dataset,size=0.1) #Now plot the points with a different shape. #This requires the workaround that was discussed in "Notes" p <- ggvis::prop(property="shape",x="cross") plot(clustering,subspace_dataset,props=p)
#Load the example dataset for this package data("subspace_dataset") #Load the true clustering for this dataset path_to_clustering <- paste(path.package("subspace"),"/extdata/subspace_dataset.true",sep="") clustering <- clustering_from_file(file_path=path_to_clustering) #also generate a clustering with one of the algorithms clustering2 <- CLIQUE(subspace_dataset,tau=0.2) #now plot the generated clustering plot(clustering2,subspace_dataset) #plot the true clustering with small points plot(clustering,subspace_dataset,size=0.1) #Now plot the points with a different shape. #This requires the workaround that was discussed in "Notes" p <- ggvis::prop(property="shape",x="cross") plot(clustering,subspace_dataset,props=p)
The ProClus algorithm works in a manner similar to K-Medoids. Initially, a set of medoids of a size that is proportional to k is chosen. Then medoids that are likely to be outliers or are part of a cluster that is better represented by another medoid are removed until k medoids are left. Clusters are then assumed to be around these medoids.
ProClus(data, k = 4, d = 3)
ProClus(data, k = 4, d = 3)
data |
A Matrix of input data. |
k |
Number of Clusters to be found. |
d |
Average number of dimensions in which the clusters reside |
C. C. Aggarwal and C. Procopiuc Fast Algorithms for Projected Clustering. In Proc. ACM SIGMOD 1999.
Other subspace.clustering.algorithms: CLIQUE
;
FIRES
; P3C
;
SubClu
data("subspace_dataset") ProClus(subspace_dataset,k=12,d=2.5)
data("subspace_dataset") ProClus(subspace_dataset,k=12,d=2.5)
The SubClu Algorithm follows a bottom-up framework, in which one-dimensional clusters are generated with DBSCAN and then each cluster is expanded one dimension at a time into a dimension that is known to have a cluster that only differs in one dimension from this cluster. This expansion is done using DBSCAN with the same parameters that were used for the original DBSCAN that produced the clusters.
SubClu(data, epsilon = 4, minSupport = 4)
SubClu(data, epsilon = 4, minSupport = 4)
data |
A Matrix of input data. |
epsilon |
size of environment parameter for DBSCAN |
minSupport |
minimum number of points parameter for DBSCAN |
Karin Kailing, Hans-Peter Kriegel and Peer Kröger Density-Connected Subspace Clustering for High-Dimensional Data
Other subspace.clustering.algorithms: CLIQUE
;
FIRES
; P3C
;
ProClus
data("subspace_dataset") SubClu(subspace_dataset,epsilon=1,minSupport=5)
data("subspace_dataset") SubClu(subspace_dataset,epsilon=1,minSupport=5)
This package provides access to five of the most popular clustering algorithms in the subspace paradigm.
The algorithms CLIQUE
, P3C
, ProClus
,
SubClu
and FIRES
can be applied to data.frames and matrices and will
return S3 objects representing clusterings. For example, using the built-in
demo dataset, you can do:
>data("subspace_dataset")
>clustering <-
P3C(subspace_dataset,PoissonThreshold=2)
>clustering
Subspace clustering generated by the package Subspace, containing 12
clusters.
These subspace_clustering objects are actually just lists of subspace_cluster objects, which can be accessed as follows.
>clustering[[1]]
Subspace cluster generated by the package
Subspace. This cluster consists of 140 objects in a 3 dimensional subspace.
Each of these clusters then holds a vector representing its subspace and a
vector with the indexes of the objects the belong in this cluster. In this
example, these could be accessed as clustering[[1]]$objects
and
clustering[[1]]$subspace
.
This package also provides a plot
method for
subspace_clustering objects:
>plot(clustering,subspace_dataset)
Showing dynamic
visualisation. Press Escape/Ctrl + C to stop.
These plots are created using the ggvis
package.
Finally, you can save clusterings to a file using the
clustering_from_file
and clustering_to_file
functions.
For example you could save the clustering from this example to a file and load the true clustering of the demo dataset:
>clustering_to_file(clustering,file_path="clustering.txt")
>path_to_clustering <- paste(path.package("subspace"),"/extdata/subspace_dataset.true",sep="")
true_clustering <- clustering_from_file(file_path=path_to_clustering)
Müller E., Günnemann S., Assent I., Seidl T.: Evaluating Clustering in Subspace Projections of High Dimensional Data http://dme.rwth-aachen.de/OpenSubspace/ In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009)
This is a synthetic dataset to demonstrate the subspace clustering algorithms in the package subspace