Title: | Tools for Clustering High-Dimensional Data |
---|---|
Description: | Tools for clustering high-dimensional data. In particular, it contains the methods described in <doi:10.1093/bioinformatics/btaa243>, <arXiv:2010.00950>. |
Authors: | Jakob Raymaekers [aut, cre], Ruben Zamar [aut] |
Maintainer: | Jakob Raymaekers <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.2 |
Built: | 2024-12-14 06:41:53 UTC |
Source: | CRAN |
Make diagnostic plots for HTK-means clustering.
diagPlot(HTKmeans.out, type = 1)
diagPlot(HTKmeans.out, type = 1)
HTKmeans.out |
the output of a call to |
type |
if |
This visualization plots the regularization path or the differences in WCSS and ARI against the number of active variables.
No return value, makes the plot directly.
J. Raymaekers and R.H. Zamar
Raymaekers, Jakob, and Ruben H. Zamar. "Regularized K-means through hard-thresholding." arXiv preprint arXiv:2010.00950 (2020).
X <- iris[, -5] lambdas <- seq(0, 1, by = 0.01) HTKmeans.out <- HTKmeans(X, 3, lambdas) diagPlot(HTKmeans.out, 1) diagPlot(HTKmeans.out, 2)
X <- iris[, -5] lambdas <- seq(0, 1, by = 0.01) HTKmeans.out <- HTKmeans(X, 3, lambdas) diagPlot(HTKmeans.out, 1) diagPlot(HTKmeans.out, 2)
Select the regularization parameter for HTK-means clustering based on information criteria.
getLambda(HTKmeans.out, type = "AIC")
getLambda(HTKmeans.out, type = "AIC")
HTKmeans.out |
the output of a call to |
type |
either |
This function selects the best lambda (based on information
criteria AIC or BIC) out of the HTKmeans.out$inputargs$lambdas
sequence of values.
The selected value for lambda
J. Raymaekers and R.H. Zamar
Raymaekers, Jakob, and Ruben H. Zamar. "Regularized K-means through hard-thresholding." arXiv preprint arXiv:2010.00950 (2020).
X <- mclust::banknote y <- as.numeric(as.factor(X[, 1])) lambdas <- seq(0, 1, by = 0.01) X <- X[, -1] HTKmeans.out <- HTKmeans(X, 2, lambdas) # Both AIC and BIC suggest a lambda of 0.02 here: getLambda(HTKmeans.out, "AIC") getLambda(HTKmeans.out, "BIC")
X <- mclust::banknote y <- as.numeric(as.factor(X[, 1])) lambdas <- seq(0, 1, by = 0.01) X <- X[, -1] HTKmeans.out <- HTKmeans(X, 2, lambdas) # Both AIC and BIC suggest a lambda of 0.02 here: getLambda(HTKmeans.out, "AIC") getLambda(HTKmeans.out, "BIC")
Perform HTK-means clustering (Raymaekers and Zamar, 2022) on a data matrix.
HTKmeans(X, k, lambdas = NULL, standardize = TRUE, iter.max = 100, nstart = 100, nlambdas = 50, lambda_max = 1, verbose = FALSE)
HTKmeans(X, k, lambdas = NULL, standardize = TRUE, iter.max = 100, nstart = 100, nlambdas = 50, lambda_max = 1, verbose = FALSE)
X |
a matrix containing the data. |
k |
the number of clusters. |
lambdas |
a vector of values for the regularization parameter |
standardize |
logical flag for standardization to mean 0 and variance 1 of
the data in |
iter.max |
the maximum number of iterations allowed. |
nstart |
number of starts used when k-means is applied to generate the starting values for HTK-means. See below for more info. |
nlambdas |
Number of lambda values to generate automatically. |
lambda_max |
Maximum value for the regularization paramater |
verbose |
Whether or not to print progress. Defaults to |
The algorithm starts by generating a number of sparse starting values. This is done using k-means on subsets of variables. See Raymaekers and Zamar (2022) for details.
A list with components:
HTKmeans.out
A list with length equal to the number of lambda values supplied in lambdas
.
Each element of this list is in turn a list containing
centers A matrix of cluster centres.
cluster A vector of integers (from 1:k
) indicating the cluster to which each point is allocated.
itnb The number of iterations executed until convergence
converged Whether the algorithm stopped by converging or through reaching the maximum number of itertions.
inputargs
the input arguments to the function.
J. Raymaekers and R.H. Zamar
Raymaekers, Jakob, and Ruben H. Zamar. "Regularized K-means through hard-thresholding." arXiv preprint arXiv:2010.00950 (2020).
X <- iris[, 1:4] HTKmeans.out <- HTKmeans(X, k = 3, lambdas = 0.8) HTKmeans.out[[1]]$centers pairs(X, col = HTKmeans.out[[1]]$cluster)
X <- iris[, 1:4] HTKmeans.out <- HTKmeans(X, k = 3, lambdas = 0.8) HTKmeans.out[[1]]$centers pairs(X, col = HTKmeans.out[[1]]$cluster)
The function computes a scale for each variable in the data. The result can then be used to standardize a dataset before applying a clustering algorithm (such as k-means). The scale estimation is based on pooled scale estimators, which result from clustering the individual variables in the data. The method is proposed in Raymaekers, and Zamar (2020) <doi:10.1093/bioinformatics/btaa243>.
PVS(X, kmax = 3, dist = "euclidean", method = "gap", B = 1000, gapMethod = "firstSEmax", minSize = 0.05, rDist = runif, SE.factor = 1, refDist = NULL)
PVS(X, kmax = 3, dist = "euclidean", method = "gap", B = 1000, gapMethod = "firstSEmax", minSize = 0.05, rDist = runif, SE.factor = 1, refDist = NULL)
X |
an |
kmax |
maximum number of clusters in one variable. Default is |
dist |
|
method |
either |
B |
number of bootstrap samples for the reference distribution of the gap statistic. Default is |
gapMethod |
method to define number of clusters in the gap statistic. See |
minSize |
minimum cluster size as a percentage of the total number of observations. Defaults to |
rDist |
Optional. Reference distribution (as a function) for the gap statistic. Defaults to |
SE.factor |
factor for determining number of clusters when using the gap statistic. See |
refDist |
Optional. A |
A vector of length p
containing the estimated scales for the variables.
Jakob Raymaekers
Raymaekers, J, Zamar, R.H. (2020). Pooled variable scaling for cluster analysis. Bioinformatics, 36(12), 3849-3855. doi:10.1093/bioinformatics/btaa243
X <- iris[, -5] y <- unclass(iris[, 5]) # Compute scales using different scale estimators. # the pooled standard deviation is considerably smaller for variable 3 and 4: sds <- apply(X, 2, sd); round(sds, 2) ranges <- apply(X, 2, function(y) diff(range(y))); round(ranges, 2) psds <- PVS(X); round(psds, 2) # Now cluster using k-means after scaling the data nbclus <- 3 kmeans.std <- kmeans(X, nbclus, nstart = 100) # no scaling kmeans.sd <- kmeans(scale(X), nbclus, nstart = 100) kmeans.rg <- kmeans(scale(X, scale = ranges), nbclus, nstart = 100) kmeans.psd <- kmeans(scale(X, scale = psds), nbclus, nstart = 100) # Calculate the Adjusted Rand Index for each of the clustering outcomes round(mclust::adjustedRandIndex(y, kmeans.std$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.sd$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.rg$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.psd$cluster), 2)
X <- iris[, -5] y <- unclass(iris[, 5]) # Compute scales using different scale estimators. # the pooled standard deviation is considerably smaller for variable 3 and 4: sds <- apply(X, 2, sd); round(sds, 2) ranges <- apply(X, 2, function(y) diff(range(y))); round(ranges, 2) psds <- PVS(X); round(psds, 2) # Now cluster using k-means after scaling the data nbclus <- 3 kmeans.std <- kmeans(X, nbclus, nstart = 100) # no scaling kmeans.sd <- kmeans(scale(X), nbclus, nstart = 100) kmeans.rg <- kmeans(scale(X, scale = ranges), nbclus, nstart = 100) kmeans.psd <- kmeans(scale(X, scale = psds), nbclus, nstart = 100) # Calculate the Adjusted Rand Index for each of the clustering outcomes round(mclust::adjustedRandIndex(y, kmeans.std$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.sd$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.rg$cluster), 2) round(mclust::adjustedRandIndex(y, kmeans.psd$cluster), 2)