Title: | Tools for Cluster Analysis |
---|---|
Description: | Cluster analysis is one of the most fundamental problems in data science. We provide a variety of algorithms from clustering to the learning on the space of partitions. See Hennig, Meila, and Rocci (2016, ISBN:9781466551886) for general exposition to cluster analysis. |
Authors: | Kisung You [aut, cre] |
Maintainer: | Kisung You <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-11-27 06:56:24 UTC |
Source: | CRAN |
Compute Adjusted Rand index between two clusterings. Please note that the value can yield negative value.
compare.adjrand(x, y)
compare.adjrand(x, y)
x |
1st cluster label vector of length- |
y |
2nd cluster label vector of length- |
Adjusted Rand Index value.
# ------------------------------------------------------------- # true label vs. clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## CLUSTERING WITH DIFFERENT K VALUES vec_k = 2:7 vec_cl = list() for (i in 1:length(vec_k)){ vec_cl[[i]] = T4cluster::kmeans(X, k=round(vec_k[i]))$cluster } ## COMPUTE COMPARISON INDICES vec_comp = rep(0, length(vec_k)) for (i in 1:length(vec_k)){ vec_comp[i] = compare.adjrand(vec_cl[[i]], lab) } ## VISUALIZE opar <- par(no.readonly=TRUE) plot(vec_k, vec_comp, type="b", lty=2, xlab="number of clusters", ylab="comparison index", main="Adjusted Rand Index with true k=3") abline(v=3, lwd=2, col="red") par(opar)
# ------------------------------------------------------------- # true label vs. clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## CLUSTERING WITH DIFFERENT K VALUES vec_k = 2:7 vec_cl = list() for (i in 1:length(vec_k)){ vec_cl[[i]] = T4cluster::kmeans(X, k=round(vec_k[i]))$cluster } ## COMPUTE COMPARISON INDICES vec_comp = rep(0, length(vec_k)) for (i in 1:length(vec_k)){ vec_comp[i] = compare.adjrand(vec_cl[[i]], lab) } ## VISUALIZE opar <- par(no.readonly=TRUE) plot(vec_k, vec_comp, type="b", lty=2, xlab="number of clusters", ylab="comparison index", main="Adjusted Rand Index with true k=3") abline(v=3, lwd=2, col="red") par(opar)
Compute Rand index between two clusterings. It has a value between 0 and 1 where 0 indicates two clusterings do not agree and 1 exactly the same.
compare.rand(x, y)
compare.rand(x, y)
x |
1st cluster label vector of length- |
y |
2nd cluster label vector of length- |
Rand Index value.
Rand WM (1971). “Objective Criteria for the Evaluation of Clustering Methods.” Journal of the American Statistical Association, 66(336), 846. ISSN 01621459.
# ------------------------------------------------------------- # true label vs. clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## CLUSTERING WITH DIFFERENT K VALUES vec_k = 2:7 vec_cl = list() for (i in 1:length(vec_k)){ vec_cl[[i]] = T4cluster::kmeans(X, k=round(vec_k[i]))$cluster } ## COMPUTE COMPARISON INDICES vec_comp = rep(0, length(vec_k)) for (i in 1:length(vec_k)){ vec_comp[i] = compare.rand(vec_cl[[i]], lab) } ## VISUALIZE opar <- par(no.readonly=TRUE) plot(vec_k, vec_comp, type="b", lty=2, xlab="number of clusters", ylab="comparison index", main="Rand Index with true k=3") abline(v=3, lwd=2, col="red") par(opar)
# ------------------------------------------------------------- # true label vs. clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## CLUSTERING WITH DIFFERENT K VALUES vec_k = 2:7 vec_cl = list() for (i in 1:length(vec_k)){ vec_cl[[i]] = T4cluster::kmeans(X, k=round(vec_k[i]))$cluster } ## COMPUTE COMPARISON INDICES vec_comp = rep(0, length(vec_k)) for (i in 1:length(vec_k)){ vec_comp[i] = compare.rand(vec_cl[[i]], lab) } ## VISUALIZE opar <- par(no.readonly=TRUE) plot(vec_k, vec_comp, type="b", lty=2, xlab="number of clusters", ylab="comparison index", main="Rand Index with true k=3") abline(v=3, lwd=2, col="red") par(opar)
DP-means is a non-parametric clustering method motivated by DP mixture model in that
the number of clusters is determined by a parameter . The larger
the
value is, the smaller the number of clusters is attained.
In addition to the original paper, we added an option to randomly permute
an order of updating for each observation's membership as a common
heuristic in the literature of cluster analysis.
dpmeans(data, lambda = 0.1, ...)
dpmeans(data, lambda = 0.1, ...)
data |
an |
lambda |
a threshold to define a new cluster (default: 0.1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Kulis B, Jordan MI (2012). “Revisiting K-Means: New Algorithms via Bayesian Nonparametrics.” In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, 1131–1138. ISBN 978-1-4503-1285-1.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT LAMBDA VALUES dpm1 = dpmeans(X, lambda=1)$cluster dpm2 = dpmeans(X, lambda=5)$cluster dpm3 = dpmeans(X, lambda=25)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=dpm1, pch=19, main="dpmeans: lambda=1") plot(X2d, col=dpm2, pch=19, main="dpmeans: lambda=5") plot(X2d, col=dpm3, pch=19, main="dpmeans: lambda=25") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT LAMBDA VALUES dpm1 = dpmeans(X, lambda=1)$cluster dpm2 = dpmeans(X, lambda=5)$cluster dpm3 = dpmeans(X, lambda=25)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=dpm1, pch=19, main="dpmeans: lambda=1") plot(X2d, col=dpm2, pch=19, main="dpmeans: lambda=5") plot(X2d, col=dpm3, pch=19, main="dpmeans: lambda=25") par(opar)
Ensembles of K-Subspaces method exploits multiple runs of K-Subspace Clustering and
uses consensus framework to aggregate multiple clustering results
to mitigate the effect of random initializations. When the results are merged,
it zeros out number of values in a co-occurrence matrix. The paper
suggests to use large number of runs (
B
) where each run may not require
large number of iterations (iter
) since the main assumption of the
algorithm is to utilize multiple partially-correct information. At the extreme case,
iteration iter
may be set to 0 for which the paper denotes it as EKSS-0.
EKSS(data, k = 2, d = 2, q = floor(nrow(data) * 0.75), B = 500, iter = 0)
EKSS(data, k = 2, d = 2, q = floor(nrow(data) * 0.75), B = 500, iter = 0)
data |
an |
k |
the number of clusters (default: 2). |
d |
candidate dimension for each subspace (default: 2). |
q |
threshold; the number of smaller values to be zeroed out (default: 0.75* |
B |
the number of ensembles/runs (default: 500). |
iter |
the number of iteration for each run (default: 0). |
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Lipor J, Hong D, Tan YS, Balzano L (2021). “Subspace Clustering Using Ensembles of -Subspaces.” arXiv:1709.04744.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run EKSS algorithm with k=2,3,4 with EKSS-0 and 5 iterations out2zero = EKSS(data, k=2) out3zero = EKSS(data, k=3) out4zero = EKSS(data, k=4) out2iter = EKSS(data, k=2, iter=5) out3iter = EKSS(data, k=3, iter=5) out4iter = EKSS(data, k=4, iter=5) ## extract label information lab2zero = out2zero$cluster lab3zero = out3zero$cluster lab4zero = out4zero$cluster lab2iter = out2iter$cluster lab3iter = out3iter$cluster lab4iter = out4iter$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,3)) plot(dat2, pch=19, cex=0.9, col=lab2zero, main="EKSS-0:K=2") plot(dat2, pch=19, cex=0.9, col=lab3zero, main="EKSS-0:K=3") plot(dat2, pch=19, cex=0.9, col=lab4zero, main="EKSS-0:K=4") plot(dat2, pch=19, cex=0.9, col=lab2iter, main="EKSS iter:K=2") plot(dat2, pch=19, cex=0.9, col=lab3iter, main="EKSS iter:K=3") plot(dat2, pch=19, cex=0.9, col=lab4iter, main="EKSS iter:K=4") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run EKSS algorithm with k=2,3,4 with EKSS-0 and 5 iterations out2zero = EKSS(data, k=2) out3zero = EKSS(data, k=3) out4zero = EKSS(data, k=4) out2iter = EKSS(data, k=2, iter=5) out3iter = EKSS(data, k=3, iter=5) out4iter = EKSS(data, k=4, iter=5) ## extract label information lab2zero = out2zero$cluster lab3zero = out3zero$cluster lab4zero = out4zero$cluster lab2iter = out2iter$cluster lab3iter = out3iter$cluster lab4iter = out4iter$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,3)) plot(dat2, pch=19, cex=0.9, col=lab2zero, main="EKSS-0:K=2") plot(dat2, pch=19, cex=0.9, col=lab3zero, main="EKSS-0:K=3") plot(dat2, pch=19, cex=0.9, col=lab4zero, main="EKSS-0:K=4") plot(dat2, pch=19, cex=0.9, col=lab2iter, main="EKSS iter:K=2") plot(dat2, pch=19, cex=0.9, col=lab3iter, main="EKSS iter:K=3") plot(dat2, pch=19, cex=0.9, col=lab4iter, main="EKSS iter:K=4") par(opar)
Given curves
,
perform hierarchical agglomerative clustering with fastcluster package's implementation of
the algorithm. Dissimilarity for curves is measured by
metric.
funhclust( fdobj, p = 2, method = c("single", "complete", "average", "mcquitty", "ward.D", "ward.D2", "centroid", "median"), members = NULL )
funhclust( fdobj, p = 2, method = c("single", "complete", "average", "mcquitty", "ward.D", "ward.D2", "centroid", "median"), members = NULL )
fdobj |
a |
p |
an exponent in |
method |
agglomeration method to be used. This must be one of |
members |
|
an object of class hclust
. See hclust
for details.
Ferreira L, Hitchcock DB (2009). “A Comparison of Hierarchical Methods for Clustering Functional Data.” Communications in Statistics - Simulation and Computation, 38(9), 1925–1949. ISSN 0361-0918, 1532-4141.
# ------------------------------------------------------------- # two types of curves # # type 1 : sin(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 2 : cos(x) + perturbation; 20 OF THESE ON [0, 2*PI] # ------------------------------------------------------------- ## PREPARE : USE 'fda' PACKAGE # Generate Raw Data datx = seq(from=0, to=2*pi, length.out=100) daty = array(0,c(100, 40)) for (i in 1:20){ daty[,i] = sin(datx) + rnorm(100, sd=0.1) daty[,i+20] = cos(datx) + rnorm(100, sd=0.1) } # Wrap as 'fd' object mybasis <- fda::create.bspline.basis(c(0,2*pi), nbasis=10) myfdobj <- fda::smooth.basis(datx, daty, mybasis)$fd ## RUN THE ALGORITHM hcsingle = funhclust(myfdobj, method="single") ## VISUALIZE opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) matplot(datx, daty[,1:20], type="l", main="Curves Type 1") matplot(datx, daty[,21:40], type="l", main="Curves Type 2") plot(hcsingle, main="hclust with 'single' linkage") par(opar)
# ------------------------------------------------------------- # two types of curves # # type 1 : sin(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 2 : cos(x) + perturbation; 20 OF THESE ON [0, 2*PI] # ------------------------------------------------------------- ## PREPARE : USE 'fda' PACKAGE # Generate Raw Data datx = seq(from=0, to=2*pi, length.out=100) daty = array(0,c(100, 40)) for (i in 1:20){ daty[,i] = sin(datx) + rnorm(100, sd=0.1) daty[,i+20] = cos(datx) + rnorm(100, sd=0.1) } # Wrap as 'fd' object mybasis <- fda::create.bspline.basis(c(0,2*pi), nbasis=10) myfdobj <- fda::smooth.basis(datx, daty, mybasis)$fd ## RUN THE ALGORITHM hcsingle = funhclust(myfdobj, method="single") ## VISUALIZE opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) matplot(datx, daty[,1:20], type="l", main="Curves Type 1") matplot(datx, daty[,21:40], type="l", main="Curves Type 2") plot(hcsingle, main="hclust with 'single' linkage") par(opar)
Given curves
,
perform
-means clustering on the coefficients from the functional data expanded by
B-spline basis. Note that in the original paper, authors used B-splines as the choice of basis
due to nice properties. However, we allow other types of basis as well for convenience.
funkmeans03A(fdobj, k = 2, ...)
funkmeans03A(fdobj, k = 2, ...)
fdobj |
a |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a 'fd'
object of mean curves.
name of the algorithm.
Abraham C, Cornillon PA, Matzner-Lober E, Molinari N (2003). “Unsupervised Curve Clustering Using B-Splines.” Scandinavian Journal of Statistics, 30(3), 581–595. ISSN 0303-6898, 1467-9469.
# ------------------------------------------------------------- # two types of curves # # type 1 : sin(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 2 : cos(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 3 : sin(x) + cos(0.5x) ; 20 OF THESE ON [0, 2*PI] # ------------------------------------------------------------- ## PREPARE : USE 'fda' PACKAGE # Generate Raw Data datx = seq(from=0, to=2*pi, length.out=100) daty = array(0,c(100, 60)) for (i in 1:20){ daty[,i] = sin(datx) + rnorm(100, sd=0.5) daty[,i+20] = cos(datx) + rnorm(100, sd=0.5) daty[,i+40] = sin(datx) + cos(0.5*datx) + rnorm(100, sd=0.5) } # Wrap as 'fd' object mybasis <- fda::create.bspline.basis(c(0,2*pi), nbasis=10) myfdobj <- fda::smooth.basis(datx, daty, mybasis)$fd ## RUN THE ALGORITHM WITH K=2,3,4 fk2 = funkmeans03A(myfdobj, k=2) fk3 = funkmeans03A(myfdobj, k=3) fk4 = funkmeans03A(myfdobj, k=4) ## FUNCTIONAL PCA FOR VISUALIZATION embed = fda::pca.fd(myfdobj, nharm=2)$score ## VISUALIZE opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(embed, col=fk2$cluster, pch=19, main="K=2") plot(embed, col=fk3$cluster, pch=19, main="K=3") plot(embed, col=fk4$cluster, pch=19, main="K=4") par(opar)
# ------------------------------------------------------------- # two types of curves # # type 1 : sin(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 2 : cos(x) + perturbation; 20 OF THESE ON [0, 2*PI] # type 3 : sin(x) + cos(0.5x) ; 20 OF THESE ON [0, 2*PI] # ------------------------------------------------------------- ## PREPARE : USE 'fda' PACKAGE # Generate Raw Data datx = seq(from=0, to=2*pi, length.out=100) daty = array(0,c(100, 60)) for (i in 1:20){ daty[,i] = sin(datx) + rnorm(100, sd=0.5) daty[,i+20] = cos(datx) + rnorm(100, sd=0.5) daty[,i+40] = sin(datx) + cos(0.5*datx) + rnorm(100, sd=0.5) } # Wrap as 'fd' object mybasis <- fda::create.bspline.basis(c(0,2*pi), nbasis=10) myfdobj <- fda::smooth.basis(datx, daty, mybasis)$fd ## RUN THE ALGORITHM WITH K=2,3,4 fk2 = funkmeans03A(myfdobj, k=2) fk3 = funkmeans03A(myfdobj, k=3) fk4 = funkmeans03A(myfdobj, k=4) ## FUNCTIONAL PCA FOR VISUALIZATION embed = fda::pca.fd(myfdobj, nharm=2)$score ## VISUALIZE opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(embed, col=fk2$cluster, pch=19, main="K=2") plot(embed, col=fk3$cluster, pch=19, main="K=3") plot(embed, col=fk4$cluster, pch=19, main="K=4") par(opar)
Generate from Three 5-dimensional Subspaces in 200-dimensional space.
gen3S(n = 50, var = 0.3)
gen3S(n = 50, var = 0.3)
n |
the number of data points sampled from each subspace (default: 50). |
var |
degree of Gaussian noise (default: 0.3). |
a named list containing with :
an data matrix.
length- vector for class label.
Wang S, Yuan X, Yao T, Yan S, Shen J (2011). “Efficient Subspace Segmentation via Quadratic Programming.” In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI'11, 519–524.
## a toy example tester = gen3S(n=100) data = tester$data label = tester$class
## a toy example tester = gen3S(n=100) data = tester$data label = tester$class
It generates nested donuts, which are just hollow circles. For flexible
testing, the parameter k
controls the number of circles of varying
radii where n
controls the number of observations for each circle.
genDONUTS(n = 50, k = 2, sd = 0.1)
genDONUTS(n = 50, k = 2, sd = 0.1)
n |
the number of data points for each hollow circle (default: 50). |
k |
the number of circles (default: 2). |
sd |
magnitude of white noise (default: 0.1). |
a named list containing with :
an data matrix.
a length- vector(factor) for class labels.
## generate data donut2 = genDONUTS(k=2) donut3 = genDONUTS(k=3) donut4 = genDONUTS(k=4) ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(donut2$data, col=donut2$label, pch=19, main="k=2") plot(donut3$data, col=donut3$label, pch=19, main="k=3") plot(donut4$data, col=donut4$label, pch=19, main="k=4") par(opar)
## generate data donut2 = genDONUTS(k=2) donut3 = genDONUTS(k=3) donut4 = genDONUTS(k=4) ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(donut2$data, col=donut2$label, pch=19, main="k=2") plot(donut3$data, col=donut3$label, pch=19, main="k=3") plot(donut4$data, col=donut4$label, pch=19, main="k=4") par(opar)
This function generates a toy example of 'line and plane' data in that
data are generated from a mixture of lines (one-dimensional) planes (two-dimensional).
The number of line- and plane-components are explicitly set by the user for flexible testing.
genLP(n = 100, nl = 1, np = 1, iso.var = 0.1)
genLP(n = 100, nl = 1, np = 1, iso.var = 0.1)
n |
the number of data points for each line and plane. |
nl |
the number of line components. |
np |
the number of plane components. |
iso.var |
degree of isotropic variance. |
a named list containing with :
an data matrix.
length- vector for class label.
length- vector of corresponding dimension from which an observation is created.
## test for visualization set.seed(10) tester = genLP(n=100, nl=1, np=2, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,2), pty="s") plot(dat2[,1],dat2[,2],pch=19,cex=0.5,col=label,main="PCA") plot(data[,1],data[,2],pch=19,cex=0.5,col=label,main="Axis 1 vs 2") plot(data[,1],data[,3],pch=19,cex=0.5,col=label,main="Axis 1 vs 3") plot(data[,2],data[,3],pch=19,cex=0.5,col=label,main="Axis 2 vs 3") par(opar) ## Not run: ## visualize in 3d x11() scatterplot3d::scatterplot3d(x=data, pch=19, cex.symbols=0.5, color=label) ## End(Not run)
## test for visualization set.seed(10) tester = genLP(n=100, nl=1, np=2, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,2), pty="s") plot(dat2[,1],dat2[,2],pch=19,cex=0.5,col=label,main="PCA") plot(data[,1],data[,2],pch=19,cex=0.5,col=label,main="Axis 1 vs 2") plot(data[,1],data[,3],pch=19,cex=0.5,col=label,main="Axis 1 vs 3") plot(data[,2],data[,3],pch=19,cex=0.5,col=label,main="Axis 2 vs 3") par(opar) ## Not run: ## visualize in 3d x11() scatterplot3d::scatterplot3d(x=data, pch=19, cex.symbols=0.5, color=label) ## End(Not run)
Creates a smiley-face data in . This function is a modification
of mlbench's
mlbench.smiley
function.
genSMILEY(n = 496, sd = 0.1)
genSMILEY(n = 496, sd = 0.1)
n |
number of samples to be generated. |
sd |
additive Gaussian noise level. |
a list containing
an data matrix.
a length- vector(factor) for class labels.
## Generate SMILEY Data with Difference Noise Levels s10 = genSMILEY(200, sd=0.1) s25 = genSMILEY(200, sd=0.25) s50 = genSMILEY(200, sd=0.5) ## Visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(s10$data, col=s10$label, pch=19, main="sd=0.10") plot(s25$data, col=s25$label, pch=19, main="sd=0.25") plot(s50$data, col=s50$label, pch=19, main="sd=0.50") par(opar)
## Generate SMILEY Data with Difference Noise Levels s10 = genSMILEY(200, sd=0.1) s25 = genSMILEY(200, sd=0.25) s50 = genSMILEY(200, sd=0.5) ## Visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(s10$data, col=s10$label, pch=19, main="sd=0.10") plot(s25$data, col=s25$label, pch=19, main="sd=0.25") plot(s50$data, col=s50$label, pch=19, main="sd=0.50") par(opar)
Finite Gaussian Mixture Model (GMM) is a well-known probabilistic clustering algorithm by fitting the following distribution to the data
with parameters 's for cluster weights,
's for class means, and
's for class covariances.
This function is a wrapper for Armadillo's GMM function, which supports two types of covariance models.
gmm(data, k = 2, ...)
gmm(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a matrix where each row is a class mean.
a array where each slice is a class covariance.
a length- vector of class weights that sum to 1.
log-likelihood of the data for the fitted model.
name of the algorithm.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm(X, k=2)$cluster cl3 = gmm(X, k=3)$cluster cl4 = gmm(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm: k=2") plot(X2d, col=cl3, pch=19, main="gmm: k=3") plot(X2d, col=cl4, pch=19, main="gmm: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm(X, k=2)$cluster cl3 = gmm(X, k=3)$cluster cl4 = gmm(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm: k=2") plot(X2d, col=cl3, pch=19, main="gmm: k=3") plot(X2d, col=cl4, pch=19, main="gmm: k=4") par(opar)
When the data lies in a high-dimensional Euclidean space, fitting a model-based clustering algorithm is troublesome. This function implements an algorithm from the reference, which uses an aggregate information from an ensemble of Gaussian mixtures in combination with random projection.
gmm03F(data, k = 2, ...)
gmm03F(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Fern XZ, Brodley CE (2003). “Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, 186–193. ISBN 1577351894.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm03F(X, k=2)$cluster cl3 = gmm03F(X, k=3)$cluster cl4 = gmm03F(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(2,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm03F: k=2") plot(X2d, col=cl3, pch=19, main="gmm03F: k=3") plot(X2d, col=cl4, pch=19, main="gmm03F: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm03F(X, k=2)$cluster cl3 = gmm03F(X, k=3)$cluster cl4 = gmm03F(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(2,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm03F: k=2") plot(X2d, col=cl3, pch=19, main="gmm03F: k=3") plot(X2d, col=cl4, pch=19, main="gmm03F: k=4") par(opar)
Ruan et al. (2011) proposed a regularized covariance estimation by
graphical lasso to cope with high-dimensional scenario where conventional
GMM might incur singular covariance components. Authors proposed to use
as a regularization parameter as normally used in
sparse covariance/precision estimation problems and suggested to use the
model with the smallest BIC values.
gmm11R(data, k = 2, lambda = 1, ...)
gmm11R(data, k = 2, lambda = 1, ...)
data |
an |
k |
the number of clusters (default: 2). |
lambda |
regularization parameter for graphical lasso (default: 1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a matrix where each row is a class mean.
a array where each slice is a class covariance.
a length- vector of class weights that sum to 1.
log-likelihood of the data for the fitted model.
name of the algorithm.
Ruan L, Yuan M, Zou H (2011). “Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models.” Neural Computation, 23(6), 1605–1622. ISSN 0899-7667, 1530-888X.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## COMPARE WITH STANDARD GMM cl.gmm = gmm(X, k=3)$cluster cl.11Rf = gmm11R(X, k=3)$cluster cl.11Rd = gmm11R(X, k=3, usediag=TRUE)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(X2d, col=cl.gmm, pch=19, main="standard GMM") plot(X2d, col=cl.11Rf, pch=19, main="gmm11R: full covs") plot(X2d, col=cl.11Rd, pch=19, main="gmm11R: diagonal covs") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## COMPARE WITH STANDARD GMM cl.gmm = gmm(X, k=3)$cluster cl.11Rf = gmm11R(X, k=3)$cluster cl.11Rd = gmm11R(X, k=3, usediag=TRUE)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,3), pty="s") plot(X2d, col=cl.gmm, pch=19, main="standard GMM") plot(X2d, col=cl.11Rf, pch=19, main="gmm11R: full covs") plot(X2d, col=cl.11Rd, pch=19, main="gmm11R: diagonal covs") par(opar)
When each observation is associated with a weight
,
modifying the GMM formulation is required. Gebru et al. (2016) proposed a method
to use scaled covariance based on an observation that
by considering the positive weight as a role of precision. Currently, we provide a method with fixed weight case only while the paper also considers a Bayesian formalism on the weight using Gamma distribution.
gmm16G(data, k = 2, weight = NULL, ...)
gmm16G(data, k = 2, weight = NULL, ...)
data |
an |
k |
the number of clusters (default: 2). |
weight |
a positive weight vector of length |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a matrix where each row is a class mean.
a array where each slice is a class covariance.
a length- vector of class weights that sum to 1.
log-likelihood of the data for the fitted model.
name of the algorithm.
Gebru ID, Alameda-Pineda X, Forbes F, Horaud R (2016). “EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12), 2402–2415. ISSN 0162-8828, 2160-9292.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm16G(X, k=2)$cluster cl3 = gmm16G(X, k=3)$cluster cl4 = gmm16G(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm16G: k=2") plot(X2d, col=cl3, pch=19, main="gmm16G: k=3") plot(X2d, col=cl4, pch=19, main="gmm16G: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = gmm16G(X, k=2)$cluster cl3 = gmm16G(X, k=3)$cluster cl4 = gmm16G(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="gmm16G: k=2") plot(X2d, col=cl3, pch=19, main="gmm16G: k=3") plot(X2d, col=cl4, pch=19, main="gmm16G: k=4") par(opar)
Geodesic spherical -means algorithm is an counterpart of the spherical
-means
algorithm by replacing the cosine similarity with the squared geodesic distance,
which is the great-circle distance under the intrinsic geometry regime
on the unit hypersphere. If the data is not
normalized, it performs the normalization and proceeds thereafter.
gskmeans(data, k = 2, ...)
gskmeans(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a value of the cost function.
an matrix where each row is a unit-norm class mean.
name of the algorithm.
# ------------------------------------------------------------- # clustering with 'household' dataset # ------------------------------------------------------------- ## PREPARE data(household, package="T4cluster") X = household$data lab = as.integer(household$gender) ## EXECUTE GSKMEANS WITH VARYING K's vec.rand = rep(0, 9) for (i in 1:9){ clust_i = gskmeans(X, k=(i+1))$cluster vec.rand[i] = compare.rand(clust_i, lab) } ## VISUALIZE THE RAND INDEX opar <- par(no.readonly=TRUE) plot(2:10, vec.rand, type="b", pch=19, ylim=c(0.5, 1), ylab="Rand index",xlab="number of clusters", main="clustering quality index over varying k's.") par(opar)
# ------------------------------------------------------------- # clustering with 'household' dataset # ------------------------------------------------------------- ## PREPARE data(household, package="T4cluster") X = household$data lab = as.integer(household$gender) ## EXECUTE GSKMEANS WITH VARYING K's vec.rand = rep(0, 9) for (i in 1:9){ clust_i = gskmeans(X, k=(i+1))$cluster vec.rand[i] = compare.rand(clust_i, lab) } ## VISUALIZE THE RAND INDEX opar <- par(no.readonly=TRUE) plot(2:10, vec.rand, type="b", pch=19, ylim=c(0.5, 1), ylab="Rand index",xlab="number of clusters", main="clustering quality index over varying k's.") par(opar)
The data is taken from HSAUR3 package's household
data. We use
housing, service, and food variables and normalize them to be unit-norm so
that each observation is projected onto the 2-dimensional sphere. The data
consists of 20 males and 20 females and has been used for clustering
on the unit hypersphere.
data(household)
data(household)
a named list containing
an data matrix whose rows are unit-norm.
a length- factor for class label.
## Load the data data(household, package="T4cluster") ## Visualize the data in pairs opar <- par(no.readonly=TRUE) scatterplot3d::scatterplot3d(household$data, color=rep(c("red","blue"), each=20), pch=19, main="household expenditure on the 2-dimensional sphere", xlim=c(0,1.2), ylim=c(0,1.2), zlim=c(0,1.2), angle=45) par(opar)
## Load the data data(household, package="T4cluster") ## Visualize the data in pairs opar <- par(no.readonly=TRUE) scatterplot3d::scatterplot3d(household$data, color=rep(c("red","blue"), each=20), pch=19, main="household expenditure on the 2-dimensional sphere", xlim=c(0,1.2), ylim=c(0,1.2), zlim=c(0,1.2), angle=45) par(opar)
-means algorithm we provide is a wrapper to the Armadillo's k-means routine.
Two types of initialization schemes are employed. Please see the parameters section for more details.
kmeans(data, k = 2, ...)
kmeans(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a matrix where each row is a class mean.
within-cluster sum of squares (WCSS).
name of the algorithm.
Sanderson C, Curtin R (2016). “Armadillo: A Template-Based C++ Library for Linear Algebra.” The Journal of Open Source Software, 1(2), 26. ISSN 2475-9066.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = kmeans(X, k=2)$cluster cl3 = kmeans(X, k=3)$cluster cl4 = kmeans(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="k-means: k=2") plot(X2d, col=cl3, pch=19, main="k-means: k=3") plot(X2d, col=cl4, pch=19, main="k-means: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = kmeans(X, k=2)$cluster cl3 = kmeans(X, k=3)$cluster cl4 = kmeans(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="k-means: k=2") plot(X2d, col=cl3, pch=19, main="k-means: k=3") plot(X2d, col=cl4, pch=19, main="k-means: k=4") par(opar)
Apply -means clustering algorithm on top of the lightweight coreset
as proposed in the paper.
The smaller the set is, the faster the execution becomes with potentially larger quantization errors.
kmeans18B(data, k = 2, m = round(nrow(data)/2), ...)
kmeans18B(data, k = 2, m = round(nrow(data)/2), ...)
data |
an |
k |
the number of clusters (default: 2). |
m |
the size of coreset (default: |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a matrix where each row is a class mean.
within-cluster sum of squares (WCSS).
name of the algorithm.
Bachem O, Lucic M, Krause A (2018). “Scalable k -Means Clustering via Lightweight Coresets.” In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1119–1127. ISBN 978-1-4503-5552-0.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT CORESET SIZES WITH K=3 core1 = kmeans18B(X, k=3, m=25)$cluster core2 = kmeans18B(X, k=3, m=50)$cluster core3 = kmeans18B(X, k=3, m=100)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=core1, pch=19, main="kmeans18B: m=25") plot(X2d, col=core2, pch=19, main="kmeans18B: m=50") plot(X2d, col=core3, pch=19, main="kmeans18B: m=100") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT CORESET SIZES WITH K=3 core1 = kmeans18B(X, k=3, m=25)$cluster core2 = kmeans18B(X, k=3, m=50)$cluster core3 = kmeans18B(X, k=3, m=100)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=core1, pch=19, main="kmeans18B: m=25") plot(X2d, col=core2, pch=19, main="kmeans18B: m=50") plot(X2d, col=core3, pch=19, main="kmeans18B: m=100") par(opar)
-means++ algorithm is usually used as a fast initialization scheme, though
it can still be used as a standalone clustering algorithms by first choosing the
centroids and assign points to the nearest centroids.
kmeanspp(data, k = 2)
kmeanspp(data, k = 2)
data |
an |
k |
the number of clusters (default: 2). |
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Arthur D, Vassilvitskii S (2007). “K-Means++: The Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, 1027–1035. ISBN 978-0-89871-624-5.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = kmeanspp(X, k=2)$cluster cl3 = kmeanspp(X, k=3)$cluster cl4 = kmeanspp(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="k-means++: k=2") plot(X2d, col=cl3, pch=19, main="k-means++: k=3") plot(X2d, col=cl4, pch=19, main="k-means++: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = kmeanspp(X, k=2)$cluster cl3 = kmeanspp(X, k=3)$cluster cl4 = kmeanspp(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="k-means++: k=2") plot(X2d, col=cl3, pch=19, main="k-means++: k=3") plot(X2d, col=cl4, pch=19, main="k-means++: k=4") par(opar)
Low-Rank Representation (LRR) constructs the connectivity of the data by solving
for column-stacked data matrix and
is the
nuclear norm which is relaxation of the rank condition. If you are interested in
full implementation of the algorithm with sparse outliers and noise, please
contact the maintainer.
LRR(data, k = 2, rank = 2)
LRR(data, k = 2, rank = 2)
data |
an |
k |
the number of clusters (default: 2). |
rank |
sum of dimensions for all |
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Liu G, Lin Z, Yu Y (2010). “Robust Subspace Segmentation by Low-Rank Representation.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, 663–670. ISBN 978-1-60558-907-7.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LRR algorithm with k=2, 3, and 4 with rank=4 output2 = LRR(data, k=2, rank=4) output3 = LRR(data, k=3, rank=4) output4 = LRR(data, k=4, rank=4) ## extract label information lab2 = output2$cluster lab3 = output3$cluster lab4 = output4$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab2, main="LRR:K=2") plot(dat2, pch=19, cex=0.9, col=lab3, main="LRR:K=3") plot(dat2, pch=19, cex=0.9, col=lab4, main="LRR:K=4") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LRR algorithm with k=2, 3, and 4 with rank=4 output2 = LRR(data, k=2, rank=4) output3 = LRR(data, k=3, rank=4) output4 = LRR(data, k=4, rank=4) ## extract label information lab2 = output2$cluster lab3 = output3$cluster lab4 = output4$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab2, main="LRR:K=2") plot(dat2, pch=19, cex=0.9, col=lab3, main="LRR:K=3") plot(dat2, pch=19, cex=0.9, col=lab4, main="LRR:K=4") par(opar)
Low-Rank Subspace Clustering (LRSC) constructs the connectivity of the data by solving
for the uncorrupted data scenario where is a column-stacked
data matrix. In the current implementation, the first equality constraint
for reconstructiveness of the data can be relaxed by solving
controlled by the regularization parameter . If you are interested in
enabling a more general class of the problem suggested by authors,
please contact maintainer of the package.
LRSC(data, k = 2, type = c("relaxed", "exact"), tau = 1)
LRSC(data, k = 2, type = c("relaxed", "exact"), tau = 1)
data |
an |
k |
the number of clusters (default: 2). |
type |
type of the problem to be solved. |
tau |
regularization parameter for relaxed-constraint problem. |
for column-stacked data matrix and
is the
nuclear norm which is relaxation of the rank condition. If you are interested in
full implementation of the algorithm with sparse outliers and noise, please
contact the maintainer.
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Vidal R, Favaro P (2014). “Low Rank Subspace Clustering (LRSC).” Pattern Recognition Letters, 43, 47–61. ISSN 01678655.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LRSC algorithm with k=2,3,4 with relaxed/exact solvers out2rel = LRSC(data, k=2, type="relaxed") out3rel = LRSC(data, k=3, type="relaxed") out4rel = LRSC(data, k=4, type="relaxed") out2exc = LRSC(data, k=2, type="exact") out3exc = LRSC(data, k=3, type="exact") out4exc = LRSC(data, k=4, type="exact") ## extract label information lab2rel = out2rel$cluster lab3rel = out3rel$cluster lab4rel = out4rel$cluster lab2exc = out2exc$cluster lab3exc = out3exc$cluster lab4exc = out4exc$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,3)) plot(dat2, pch=19, cex=0.9, col=lab2rel, main="LRSC Relaxed:K=2") plot(dat2, pch=19, cex=0.9, col=lab3rel, main="LRSC Relaxed:K=3") plot(dat2, pch=19, cex=0.9, col=lab4rel, main="LRSC Relaxed:K=4") plot(dat2, pch=19, cex=0.9, col=lab2exc, main="LRSC Exact:K=2") plot(dat2, pch=19, cex=0.9, col=lab3exc, main="LRSC Exact:K=3") plot(dat2, pch=19, cex=0.9, col=lab4exc, main="LRSC Exact:K=4") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LRSC algorithm with k=2,3,4 with relaxed/exact solvers out2rel = LRSC(data, k=2, type="relaxed") out3rel = LRSC(data, k=3, type="relaxed") out4rel = LRSC(data, k=4, type="relaxed") out2exc = LRSC(data, k=2, type="exact") out3exc = LRSC(data, k=3, type="exact") out4exc = LRSC(data, k=4, type="exact") ## extract label information lab2rel = out2rel$cluster lab3rel = out3rel$cluster lab4rel = out4rel$cluster lab2exc = out2exc$cluster lab3exc = out3exc$cluster lab4exc = out4exc$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(2,3)) plot(dat2, pch=19, cex=0.9, col=lab2rel, main="LRSC Relaxed:K=2") plot(dat2, pch=19, cex=0.9, col=lab3rel, main="LRSC Relaxed:K=3") plot(dat2, pch=19, cex=0.9, col=lab4rel, main="LRSC Relaxed:K=4") plot(dat2, pch=19, cex=0.9, col=lab2exc, main="LRSC Exact:K=2") plot(dat2, pch=19, cex=0.9, col=lab3exc, main="LRSC Exact:K=3") plot(dat2, pch=19, cex=0.9, col=lab4exc, main="LRSC Exact:K=4") par(opar)
For the subspace clustering, traditional method of least squares regression is used to build coefficient matrix that reconstructs the data point by solving
where is a column-stacked data matrix.
As seen from the equation, we use a denoising version controlled by
and
provide an option to abide by the constraint
by
zerodiag
parameter.
LSR(data, k = 2, lambda = 1e-05, zerodiag = TRUE)
LSR(data, k = 2, lambda = 1e-05, zerodiag = TRUE)
data |
an |
k |
the number of clusters (default: 2). |
lambda |
regularization parameter (default: 1e-5). |
zerodiag |
a logical; |
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Lu C, Min H, Zhao Z, Zhu L, Huang D, Yan S (2012). “Robust and Efficient Subspace Segmentation via Least Squares Regression.” In Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds.), Computer Vision -ECCV 2012, volume 7578, 347–360. Springer Berlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-642-33785-7 978-3-642-33786-4.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LSR for k=3 with different lambda values out1 = LSR(data, k=3, lambda=1e-2) out2 = LSR(data, k=3, lambda=1) out3 = LSR(data, k=3, lambda=1e+2) ## extract label information lab1 = out1$cluster lab2 = out2$cluster lab3 = out3$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab1, main="LSR:lambda=1e-2") plot(dat2, pch=19, cex=0.9, col=lab2, main="LSR:lambda=1") plot(dat2, pch=19, cex=0.9, col=lab3, main="LSR:lambda=1e+2") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run LSR for k=3 with different lambda values out1 = LSR(data, k=3, lambda=1e-2) out2 = LSR(data, k=3, lambda=1) out3 = LSR(data, k=3, lambda=1e+2) ## extract label information lab1 = out1$cluster lab2 = out2$cluster lab3 = out3$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab1, main="LSR:lambda=1e-2") plot(dat2, pch=19, cex=0.9, col=lab2, main="LSR:lambda=1") plot(dat2, pch=19, cex=0.9, col=lab3, main="LSR:lambda=1e+2") par(opar)
MSM
is a Bayesian model inferring mixtures of subspaces that are of possibly different dimensions.
For simplicity, this function returns only a handful of information that are most important in
representing the mixture model, including projection, location, and hard assignment parameters.
MSM(data, k = 2, ...)
MSM(data, k = 2, ...)
data |
an |
k |
the number of mixtures. |
... |
extra parameters including
|
a list whose elements are S3 class "MSM"
instances, which are also lists of following elements:
length-k
list of projection matrices.
length-k
list of orthonormal basis.
length-k
list of center locations of each mixture.
length-n
vector of cluster label.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run MSM algorithm with k=2, 3, and 4 maxiter = 500 output2 = MSM(data, k=2, iter=maxiter) output3 = MSM(data, k=3, iter=maxiter) output4 = MSM(data, k=4, iter=maxiter) ## extract final clustering information nrec = length(output2) finc2 = output2[[nrec]]$cluster finc3 = output3[[nrec]]$cluster finc4 = output4[[nrec]]$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(3,4)) plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc2+1,main="K=2:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc3+1,main="K=3:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc4+1,main="K=4:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(2,3)") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run MSM algorithm with k=2, 3, and 4 maxiter = 500 output2 = MSM(data, k=2, iter=maxiter) output3 = MSM(data, k=3, iter=maxiter) output4 = MSM(data, k=4, iter=maxiter) ## extract final clustering information nrec = length(output2) finc2 = output2[[nrec]]$cluster finc3 = output3[[nrec]]$cluster finc4 = output4[[nrec]]$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(3,4)) plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc2+1,main="K=2:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc2+1,main="K=2:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc3+1,main="K=3:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc3+1,main="K=3:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=finc4+1,main="K=4:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=finc4+1,main="K=4:Axis(2,3)") par(opar)
Let clustering be a label from data of observations and suppose
we are given
such labels. Co-occurrent matrix counts the number of events
where two observations
and
belong to the same category/class.
PCM serves as a measure of uncertainty embedded in any algorithms with non-deterministic components.
pcm(partitions)
pcm(partitions)
partitions |
partitions can be provided in either (1) an |
an matrix, whose elements
are counts for
how many times observations
and
belong to the same cluster, ranging from
to
.
# ------------------------------------------------------------- # PSM with 'iris' dataset + k-means++ # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## RUN K-MEANS++ 100 TIMES partitions = list() for (i in 1:100){ partitions[[i]] = kmeanspp(X)$cluster } ## COMPUTE PCM iris.pcm = pcm(partitions) ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") image(iris.pcm[,150:1], axes=FALSE, main="PCM") par(opar)
# ------------------------------------------------------------- # PSM with 'iris' dataset + k-means++ # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## RUN K-MEANS++ 100 TIMES partitions = list() for (i in 1:100){ partitions[[i]] = kmeanspp(X)$cluster } ## COMPUTE PCM iris.pcm = pcm(partitions) ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") image(iris.pcm[,150:1], axes=FALSE, main="PCM") par(opar)
Given an instance of MSM
class from MSM
function, predict
class label of a new data.
## S3 method for class 'MSM' predict(object, newdata, ...)
## S3 method for class 'MSM' predict(object, newdata, ...)
object |
an |
newdata |
an |
... |
extra parameters (not necessary). |
a length- vector of class labels.
Let clustering be a label from data of observations and suppose
we are given
such labels. Posterior similarity matrix, as its name suggests,
computes posterior probability for a pair of observations to belong to the same cluster, i.e.,
under the scenario where multiple clusterings are samples drawn from a posterior distribution within
the Bayesian framework. However, it can also be used for non-Bayesian settings as
psm
is a measure of uncertainty embedded in any algorithms with non-deterministic components.
psm(partitions)
psm(partitions)
partitions |
partitions can be provided in either (1) an |
an matrix, whose elements
are posterior probability
for an observation
and
belong to the same cluster.
# ------------------------------------------------------------- # PSM with 'iris' dataset + k-means++ # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## RUN K-MEANS++ 100 TIMES partitions = list() for (i in 1:100){ partitions[[i]] = kmeanspp(X)$cluster } ## COMPUTE PSM iris.psm = psm(partitions) ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") image(iris.psm[,150:1], axes=FALSE, main="PSM") par(opar)
# ------------------------------------------------------------- # PSM with 'iris' dataset + k-means++ # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## RUN K-MEANS++ 100 TIMES partitions = list() for (i in 1:100){ partitions[[i]] = kmeanspp(X)$cluster } ## COMPUTE PSM iris.psm = psm(partitions) ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,2), pty="s") plot(X2d, col=lab, pch=19, main="true label") image(iris.psm[,150:1], axes=FALSE, main="PSM") par(opar)
Zelnik-Manor and Perona proposed a method to define a set of data-driven
bandwidth parameters where is the distance from a point
to its
nnbd
-th
nearest neighbor. Then the affinity matrix is defined as
and the standard
spectral clustering of Ng, Jordan, and Weiss (scNJW
) is applied.
sc05Z(data, k = 2, nnbd = 7, ...)
sc05Z(data, k = 2, nnbd = 7, ...)
data |
an |
k |
the number of clusters (default: 2). |
nnbd |
neighborhood size to define data-driven bandwidth parameter (default: 7). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Zelnik-manor L, Perona P (2005). “Self-Tuning Spectral Clustering.” In Saul LK, Weiss Y, Bottou L (eds.), Advances in Neural Information Processing Systems 17, 1601–1608. MIT Press.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc05Z(X, k=2)$cluster cl3 = sc05Z(X, k=3)$cluster cl4 = sc05Z(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc05Z: k=2") plot(X2d, col=cl3, pch=19, main="sc05Z: k=3") plot(X2d, col=cl4, pch=19, main="sc05Z: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc05Z(X, k=2)$cluster cl3 = sc05Z(X, k=3)$cluster cl4 = sc05Z(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc05Z: k=2") plot(X2d, col=cl3, pch=19, main="sc05Z: k=3") plot(X2d, col=cl4, pch=19, main="sc05Z: k=4") par(opar)
The algorithm defines a set of data-driven
bandwidth parameters where is the average distance from a point
to its
nnbd
-th
nearest neighbor. Then the affinity matrix is defined as
and the standard
spectral clustering of Ng, Jordan, and Weiss (scNJW
) is applied.
sc09G(data, k = 2, nnbd = 7, ...)
sc09G(data, k = 2, nnbd = 7, ...)
data |
an |
k |
the number of clusters (default: 2). |
nnbd |
neighborhood size to define data-driven bandwidth parameter (default: 7). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Gu R, Wang J (2009). “An Improved Spectral Clustering Algorithm Based on Neighbour Adaptive Scale.” In 2009 International Conference on Business Intelligence and Financial Engineering, 233–236. ISBN 978-0-7695-3705-4.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc09G(X, k=2)$cluster cl3 = sc09G(X, k=3)$cluster cl4 = sc09G(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc09G: k=2") plot(X2d, col=cl3, pch=19, main="sc09G: k=3") plot(X2d, col=cl4, pch=19, main="sc09G: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc09G(X, k=2)$cluster cl3 = sc09G(X, k=3)$cluster cl4 = sc09G(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc09G: k=2") plot(X2d, col=cl3, pch=19, main="sc09G: k=3") plot(X2d, col=cl4, pch=19, main="sc09G: k=4") par(opar)
The algorithm defines a set of data-driven
bandwidth parameters by constructing a similarity matrix.
Then the affinity matrix is defined as
and the standard spectral clustering of Ng, Jordan, and Weiss (scNJW
) is applied.
sc10Z(data, k = 2, ...)
sc10Z(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Zhang Y, Zhou J, Fu Y (2010). “Spectral Clustering Algorithm Based on Adaptive Neighbor Distance Sort Order.” In The 3rd International Conference on Information Sciences and Interaction Sciences, 444–447. ISBN 978-1-4244-7384-7.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc10Z(X, k=2)$cluster cl3 = sc10Z(X, k=3)$cluster cl4 = sc10Z(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc10Z: k=2") plot(X2d, col=cl3, pch=19, main="sc10Z: k=3") plot(X2d, col=cl4, pch=19, main="sc10Z: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc10Z(X, k=2)$cluster cl3 = sc10Z(X, k=3)$cluster cl4 = sc10Z(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc10Z: k=2") plot(X2d, col=cl3, pch=19, main="sc10Z: k=3") plot(X2d, col=cl4, pch=19, main="sc10Z: k=4") par(opar)
As a data-driven method, the algorithm recovers geodesic distance from a k-nearest
neighbor graph scaled by an (exponential) parameter and applies
random-walk spectral clustering. Authors referred their method as
density sensitive similarity function.
sc11Y(data, k = 2, nnbd = 7, rho = 2, ...)
sc11Y(data, k = 2, nnbd = 7, rho = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
nnbd |
neighborhood size to define data-driven bandwidth parameter (default: 7). |
rho |
exponent scaling parameter (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Yang P, Zhu Q, Huang B (2011). “Spectral Clustering with Density Sensitive Similarity Function.” Knowledge-Based Systems, 24(5), 621–628. ISSN 09507051.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc11Y(X, k=2)$cluster cl3 = sc11Y(X, k=3)$cluster cl4 = sc11Y(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc11Y: k=2") plot(X2d, col=cl3, pch=19, main="sc11Y: k=3") plot(X2d, col=cl4, pch=19, main="sc11Y: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc11Y(X, k=2)$cluster cl3 = sc11Y(X, k=3)$cluster cl4 = sc11Y(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc11Y: k=2") plot(X2d, col=cl3, pch=19, main="sc11Y: k=3") plot(X2d, col=cl4, pch=19, main="sc11Y: k=4") par(opar)
Li and Guo proposed to construct an affinity matrix
and adjust the matrix by neighbor propagation. Then, standard spectral clustering from the symmetric, normalized graph laplacian is applied.
sc12L(data, k = 2, sigma = 1, ...)
sc12L(data, k = 2, sigma = 1, ...)
data |
an |
k |
the number of clusters (default: 2). |
sigma |
common bandwidth parameter (default: 1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Li X, Guo L (2012). “Constructing Affinity Matrix in Spectral Clustering Based on Neighbor Propagation.” Neurocomputing, 97, 125–130. ISSN 09252312.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc12L(X, k=2)$cluster cl3 = sc12L(X, k=3)$cluster cl4 = sc12L(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc12L: k=2") plot(X2d, col=cl3, pch=19, main="sc12L: k=3") plot(X2d, col=cl4, pch=19, main="sc12L: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = sc12L(X, k=2)$cluster cl3 = sc12L(X, k=3)$cluster cl4 = sc12L(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="sc12L: k=2") plot(X2d, col=cl3, pch=19, main="sc12L: k=3") plot(X2d, col=cl4, pch=19, main="sc12L: k=4") par(opar)
The version of Ng, Jordan, and Weiss first constructs the affinity matrix
where is a common bandwidth parameter and performs k-means (or possibly, GMM) clustering on
the row-space of eigenvectors for the symmetric graph laplacian matrix
.
scNJW(data, k = 2, sigma = 1, ...)
scNJW(data, k = 2, sigma = 1, ...)
data |
an |
k |
the number of clusters (default: 2). |
sigma |
bandwidth parameter (default: 1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Ng AY, Jordan MI, Weiss Y (2002). “On Spectral Clustering: Analysis and an Algorithm.” In Dietterich TG, Becker S, Ghahramani Z (eds.), Advances in Neural Information Processing Systems 14, 849–856. MIT Press.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scNJW(X, k=2)$cluster cl3 = scNJW(X, k=3)$cluster cl4 = scNJW(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scNJW: k=2") plot(X2d, col=cl3, pch=19, main="scNJW: k=3") plot(X2d, col=cl4, pch=19, main="scNJW: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scNJW(X, k=2)$cluster cl3 = scNJW(X, k=3)$cluster cl4 = scNJW(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scNJW: k=2") plot(X2d, col=cl3, pch=19, main="scNJW: k=3") plot(X2d, col=cl4, pch=19, main="scNJW: k=4") par(opar)
The version of Shi and Malik first constructs the affinity matrix
where is a common bandwidth parameter and performs k-means (or possibly, GMM) clustering on
the row-space of eigenvectors for the random-walk graph laplacian matrix
.
scSM(data, k = 2, sigma = 1, ...)
scSM(data, k = 2, sigma = 1, ...)
data |
an |
k |
the number of clusters (default: 2). |
sigma |
bandwidth parameter (default: 1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
Shi J, Malik J (Aug./2000). “Normalized Cuts and Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905. ISSN 01628828.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) sid = sample(1:150, 50) X = as.matrix(iris[sid,1:4]) lab = as.integer(as.factor(iris[sid,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scSM(X, k=2)$cluster cl3 = scSM(X, k=3)$cluster cl4 = scSM(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scSM: k=2") plot(X2d, col=cl3, pch=19, main="scSM: k=3") plot(X2d, col=cl4, pch=19, main="scSM: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE WITH SUBSET OF DATA data(iris) sid = sample(1:150, 50) X = as.matrix(iris[sid,1:4]) lab = as.integer(as.factor(iris[sid,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scSM(X, k=2)$cluster cl3 = scSM(X, k=3)$cluster cl4 = scSM(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scSM: k=2") plot(X2d, col=cl3, pch=19, main="scSM: k=3") plot(X2d, col=cl4, pch=19, main="scSM: k=4") par(opar)
The version of Shi and Malik first constructs the affinity matrix
where is a common bandwidth parameter and performs k-means (or possibly, GMM) clustering on
the row-space of eigenvectors for the unnormalized graph laplacian matrix
.
scUL(data, k = 2, sigma = 1, ...)
scUL(data, k = 2, sigma = 1, ...)
data |
an |
k |
the number of clusters (default: 2). |
sigma |
bandwidth parameter (default: 1). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
eigenvalues of the graph laplacian's spectral decomposition.
an low-dimensional embedding.
name of the algorithm.
von Luxburg U (2007). “A Tutorial on Spectral Clustering.” Statistics and Computing, 17(4), 395–416. ISSN 0960-3174, 1573-1375.
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scUL(X, k=2)$cluster cl3 = scUL(X, k=3)$cluster cl4 = scUL(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scUL: k=2") plot(X2d, col=cl3, pch=19, main="scUL: k=3") plot(X2d, col=cl4, pch=19, main="scUL: k=4") par(opar)
# ------------------------------------------------------------- # clustering with 'iris' dataset # ------------------------------------------------------------- ## PREPARE data(iris) X = as.matrix(iris[,1:4]) lab = as.integer(as.factor(iris[,5])) ## EMBEDDING WITH PCA X2d = Rdimtools::do.pca(X, ndim=2)$Y ## CLUSTERING WITH DIFFERENT K VALUES cl2 = scUL(X, k=2)$cluster cl3 = scUL(X, k=3)$cluster cl4 = scUL(X, k=4)$cluster ## VISUALIZATION opar <- par(no.readonly=TRUE) par(mfrow=c(1,4), pty="s") plot(X2d, col=lab, pch=19, main="true label") plot(X2d, col=cl2, pch=19, main="scUL: k=2") plot(X2d, col=cl3, pch=19, main="scUL: k=3") plot(X2d, col=cl4, pch=19, main="scUL: k=4") par(opar)
Spherical -means algorithm performs clustering for the data residing
on the unit hypersphere with the cosine similarity. If the data is not
normalized, it performs the normalization and proceeds thereafter.
spkmeans(data, k = 2, ...)
spkmeans(data, k = 2, ...)
data |
an |
k |
the number of clusters (default: 2). |
... |
extra parameters including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
a value of the cost function.
an matrix where each row is a unit-norm class mean.
name of the algorithm.
I. S. Dhillon and D. S. Modha (2001). "Concept decompositions for large sparse text data using clustering." Machine Learning, 42:143–175.
# ------------------------------------------------------------- # clustering with 'household' dataset # ------------------------------------------------------------- ## PREPARE data(household, package="T4cluster") X = household$data lab = as.integer(household$gender) ## EXECUTE SPKMEANS WITH VARYING K's vec.rand = rep(0, 9) for (i in 1:9){ clust_i = spkmeans(X, k=(i+1))$cluster vec.rand[i] = compare.rand(clust_i, lab) } ## VISUALIZE THE RAND INDEX opar <- par(no.readonly=TRUE) plot(2:10, vec.rand, type="b", pch=19, ylim=c(0.5, 1), ylab="Rand index",xlab="number of clusters", main="clustering quality index over varying k's.") par(opar)
# ------------------------------------------------------------- # clustering with 'household' dataset # ------------------------------------------------------------- ## PREPARE data(household, package="T4cluster") X = household$data lab = as.integer(household$gender) ## EXECUTE SPKMEANS WITH VARYING K's vec.rand = rep(0, 9) for (i in 1:9){ clust_i = spkmeans(X, k=(i+1))$cluster vec.rand[i] = compare.rand(clust_i, lab) } ## VISUALIZE THE RAND INDEX opar <- par(no.readonly=TRUE) plot(2:10, vec.rand, type="b", pch=19, ylim=c(0.5, 1), ylab="Rand index",xlab="number of clusters", main="clustering quality index over varying k's.") par(opar)
Sparse Subspace Clustering (SSC) assumes that the data points lie in
a union of low-dimensional subspaces. The algorithm constructs local
connectivity and uses the information for spectral clustering. SSC
is
an implementation based on basis pursuit for sparse reconstruction for the
model without systematic noise, which solves
for column-stacked data matrix . If you are interested in
full implementation of the algorithm with sparse outliers and noise, please
contact the maintainer.
SSC(data, k = 2)
SSC(data, k = 2)
data |
an |
k |
the number of clusters (default: 2). |
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Elhamifar E, Vidal R (2009). “Sparse Subspace Clustering.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2790–2797. ISBN 978-1-4244-3992-8.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run SSC algorithm with k=2, 3, and 4 output2 = SSC(data, k=2) output3 = SSC(data, k=3) output4 = SSC(data, k=4) ## extract label information lab2 = output2$cluster lab3 = output3$cluster lab4 = output4$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(3,4)) plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab2,main="K=2:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab2,main="K=2:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab2,main="K=2:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab2,main="K=2:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab3,main="K=3:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab3,main="K=3:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab3,main="K=3:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab3,main="K=3:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab4,main="K=4:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab4,main="K=4:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab4,main="K=4:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab4,main="K=4:Axis(2,3)") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run SSC algorithm with k=2, 3, and 4 output2 = SSC(data, k=2) output3 = SSC(data, k=3) output4 = SSC(data, k=4) ## extract label information lab2 = output2$cluster lab3 = output3$cluster lab4 = output4$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(3,4)) plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab2,main="K=2:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab2,main="K=2:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab2,main="K=2:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab2,main="K=2:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab3,main="K=3:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab3,main="K=3:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab3,main="K=3:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab3,main="K=3:Axis(2,3)") plot(dat2[,1],dat2[,2],pch=19,cex=0.3,col=lab4,main="K=4:PCA") plot(data[,1],data[,2],pch=19,cex=0.3,col=lab4,main="K=4:Axis(1,2)") plot(data[,1],data[,3],pch=19,cex=0.3,col=lab4,main="K=4:Axis(1,3)") plot(data[,2],data[,3],pch=19,cex=0.3,col=lab4,main="K=4:Axis(2,3)") par(opar)
Subspace Segmentation via Quadratic Programming (SSQP) solves the following problem
where is a column-stacked data matrix. The computed
is
used as an affinity matrix for spectral clustering.
SSQP(data, k = 2, lambda = 1e-05, ...)
SSQP(data, k = 2, lambda = 1e-05, ...)
data |
an |
k |
the number of clusters (default: 2). |
lambda |
regularization parameter (default: 1e-5). |
... |
extra parameters for the gradient descent algorithm including
|
a named list of S3 class T4cluster
containing
a length- vector of class labels (from
).
name of the algorithm.
Wang S, Yuan X, Yao T, Yan S, Shen J (2011). “Efficient Subspace Segmentation via Quadratic Programming.” In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI'11, 519–524.
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run SSQP for k=3 with different lambda values out1 = SSQP(data, k=3, lambda=1e-2) out2 = SSQP(data, k=3, lambda=1) out3 = SSQP(data, k=3, lambda=1e+2) ## extract label information lab1 = out1$cluster lab2 = out2$cluster lab3 = out3$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab1, main="SSQP:lambda=1e-2") plot(dat2, pch=19, cex=0.9, col=lab2, main="SSQP:lambda=1") plot(dat2, pch=19, cex=0.9, col=lab3, main="SSQP:lambda=1e+2") par(opar)
## generate a toy example set.seed(10) tester = genLP(n=100, nl=2, np=1, iso.var=0.1) data = tester$data label = tester$class ## do PCA for data reduction proj = base::eigen(stats::cov(data))$vectors[,1:2] dat2 = data%*%proj ## run SSQP for k=3 with different lambda values out1 = SSQP(data, k=3, lambda=1e-2) out2 = SSQP(data, k=3, lambda=1) out3 = SSQP(data, k=3, lambda=1e+2) ## extract label information lab1 = out1$cluster lab2 = out2$cluster lab3 = out3$cluster ## visualize opar <- par(no.readonly=TRUE) par(mfrow=c(1,3)) plot(dat2, pch=19, cex=0.9, col=lab1, main="SSQP:lambda=1e-2") plot(dat2, pch=19, cex=0.9, col=lab2, main="SSQP:lambda=1") plot(dat2, pch=19, cex=0.9, col=lab3, main="SSQP:lambda=1e+2") par(opar)