Title: | Minimization of the Convex Clustering Loss Function |
---|---|
Description: | Implements the convex clustering through majorization-minimization (CCMM) algorithm described in Touw, Groenen, and Terada (2022) <doi:10.48550/arXiv.2211.01877> to perform minimization of the convex clustering loss function. |
Authors: | Daniel Touw [aut, cre] , Patrick Groenen [aut] , Yoshikazu Terada [aut] |
Maintainer: | Daniel Touw <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.1 |
Built: | 2024-12-08 07:18:49 UTC |
Source: | CRAN |
cvxclust
object into an hclust
objectConverts the output of convex_clustering or convex_clusterpath into a hclust object. Note that a step in the clusterpath from one value for lambda to the next may cause the number of clusters to decrease by more than one. It is a hard requirement that the clusterpath ends in a single cluster, as standard dendrogram plotting methods fail if this is not the case.
## S3 method for class 'cvxclust' as.hclust(x, ...)
## S3 method for class 'cvxclust' as.hclust(x, ...)
x |
A |
... |
Unused. |
A hclust
object.
# Demonstration of converting a clusterpath into a dendrogram, first generate # data set.seed(6) X = matrix(rnorm(14), ncol = 2) y = rep(1, nrow(X)) # Get sparse distances in dictionary of keys format with k = 3 W = sparse_weights(X, 3, 4.0) # Sequence for lambda lambdas = seq(0, 45, 0.02) # Compute results res = convex_clusterpath(X, W, lambdas) # Generate hclust object hcl = as.hclust(res) hcl$height = sqrt(hcl$height) # Plot clusterpath and dendrogram oldpar = par(mfrow=c(1, 2)) plot(res, y, label = c(1:7)) plot(hcl, ylab = expression(sqrt(lambda)), xlab = NA, sub = NA, main = NA, hang = -1) par(oldpar)
# Demonstration of converting a clusterpath into a dendrogram, first generate # data set.seed(6) X = matrix(rnorm(14), ncol = 2) y = rep(1, nrow(X)) # Get sparse distances in dictionary of keys format with k = 3 W = sparse_weights(X, 3, 4.0) # Sequence for lambda lambdas = seq(0, 45, 0.02) # Compute results res = convex_clusterpath(X, W, lambdas) # Generate hclust object hcl = as.hclust(res) hcl$height = sqrt(hcl$height) # Plot clusterpath and dendrogram oldpar = par(mfrow=c(1, 2)) plot(res, y, label = c(1:7)) plot(hcl, ylab = expression(sqrt(lambda)), xlab = NA, sub = NA, main = NA, hang = -1) par(oldpar)
Get a particular clustering of the data. If there is a
clustering for n_clusters
, it is returned, otherwise the function will
stop with a message. To know whether a query is going to be successful
beforehand, check the num_clusters
attribute of the cvxclust
object, this lists all possible options for the number of clusters.
clusters(obj, n_clusters)
clusters(obj, n_clusters)
obj |
A |
n_clusters |
An integer that specifies the number of clusters that should be returned. |
A vector with the cluster labels for each object in the data.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) # Get labels for three clusters labels = clusters(res, 3)
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) # Get labels for three clusters labels = clusters(res, 3)
convex_clustering
attempts to find the number of clusters
specified by the user by means of convex clustering. The algorithm looks for
each number of clusters between target_low
and target_high
. If
target_low
= target_high
, the algorithm searches for a single
clustering. It is recommended to specify a range around the desired number of
clusters, as not each number of clusters between 1 and nrow(X)
may be
attainable due to numerical inaccuracies.
convex_clustering( X, W, target_low, target_high = NULL, max_iter_phase_1 = 2000, max_iter_phase_2 = 20, lambda_init = 0.01, factor = 0.025, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = FALSE, verbose = 0 )
convex_clustering( X, W, target_low, target_high = NULL, max_iter_phase_1 = 2000, max_iter_phase_2 = 20, lambda_init = 0.01, factor = 0.025, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = FALSE, verbose = 0 )
X |
An |
W |
A |
target_low |
Lower bound on the number of clusters that should be
searched for. If |
target_high |
Upper bound on the number of clusters that should be
searched for. Default is |
max_iter_phase_1 |
Maximum number of iterations to find an upper and lower bound for the value for lambda for which the desired number of clusters is attained. Default is 2000. |
max_iter_phase_2 |
Maximum number of iterations to to refine the upper and lower bounds for lambda. Default is 20. |
lambda_init |
The first value for lambda other than 0 to use for convex clustering. Default is 0.01. |
factor |
The percentage by which to increase lambda in each step. Default is 0.025. |
tau |
Parameter to compute the threshold to fuse clusters. Default is 0.001. |
center |
If |
scale |
If |
eps_conv |
Parameter for determining convergence of the minimization. Default is 1e-6. |
burnin_iter |
Number of updates of the loss function that are done without step doubling. Default is 25. |
max_iter_conv |
Maximum number of iterations for minimizing the loss function. Default is 5000. |
save_clusterpath |
If |
verbose |
Verbosity of the information printed during clustering. Default is 0, no output. |
A cvxclust
object containing the following
info |
A dataframe containing for each value for lambda: the number of different clusters, and the value of the loss function at the minimum. |
merge |
The merge table containing the order at which the
observations in |
height |
The value for lambda at which each reduction in the number of clusters occurs. |
order |
The order of the observations in |
elapsed_time |
The number of seconds that elapsed while
running the code. Note that this does not include the time required for
input checking and possibly scaling and centering |
coordinates |
The clusterpath coordinates. Only part of the
output in case that |
lambdas |
The values for lambda for which a clustering was found. |
eps_fusions |
The threshold for cluster fusions that was used by the algorithm. |
phase_1_instances |
The number of instances of the loss function
that were minimized while finding an upper and lower bound for lambda. The
sum |
phase_2_instances |
The number of instances of the loss function
that were minimized while refining the value for lambda. The sum
|
num_clusters |
The different numbers of clusters that have been found. |
n |
The number of observations in |
convex_clusterpath, sparse_weights
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Perform convex clustering with a target number of clusters res1 = convex_clustering(X, W, target_low = 2, target_high = 5) # Plot the clustering for 2 to 5 clusters oldpar = par(mfrow=c(2, 2)) plot(X, col = clusters(res1, 2), main = "2 clusters", pch = 19) plot(X, col = clusters(res1, 3), main = "3 clusters", pch = 19) plot(X, col = clusters(res1, 4), main = "4 clusters", pch = 19) plot(X, col = clusters(res1, 5), main = "5 clusters", pch = 19) # A more generalized approach to plotting the results of a range of clusters res2 = convex_clustering(X, W, target_low = 2, target_high = 7) # Plot the clusterings k = length(res2$num_clusters) par(mfrow=c(ceiling(k / ceiling(sqrt(k))), ceiling(sqrt(k)))) for (i in 1:k) { labels = clusters(res2, res2$num_clusters[i]) c = length(unique(labels)) plot(X, col = labels, main = paste(c, "clusters"), pch = 19) } par(oldpar)
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Perform convex clustering with a target number of clusters res1 = convex_clustering(X, W, target_low = 2, target_high = 5) # Plot the clustering for 2 to 5 clusters oldpar = par(mfrow=c(2, 2)) plot(X, col = clusters(res1, 2), main = "2 clusters", pch = 19) plot(X, col = clusters(res1, 3), main = "3 clusters", pch = 19) plot(X, col = clusters(res1, 4), main = "4 clusters", pch = 19) plot(X, col = clusters(res1, 5), main = "5 clusters", pch = 19) # A more generalized approach to plotting the results of a range of clusters res2 = convex_clustering(X, W, target_low = 2, target_high = 7) # Plot the clusterings k = length(res2$num_clusters) par(mfrow=c(ceiling(k / ceiling(sqrt(k))), ceiling(sqrt(k)))) for (i in 1:k) { labels = clusters(res2, res2$num_clusters[i]) c = length(unique(labels)) plot(X, col = labels, main = paste(c, "clusters"), pch = 19) } par(oldpar)
Minimizes the convex clustering loss function for a given set of values for lambda.
convex_clusterpath( X, W, lambdas, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = TRUE, target_losses = NULL, save_losses = FALSE, save_convergence_norms = FALSE )
convex_clusterpath( X, W, lambdas, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = TRUE, target_losses = NULL, save_losses = FALSE, save_convergence_norms = FALSE )
X |
An |
W |
A |
lambdas |
A vector containing the values for the penalty parameter. |
tau |
Parameter to compute the threshold to fuse clusters. Default is 0.001. |
center |
If |
scale |
If |
eps_conv |
Parameter for determining convergence of the minimization. Default is 1e-6. |
burnin_iter |
Number of updates of the loss function that are done without step doubling. Default is 25. |
max_iter_conv |
Maximum number of iterations for minimizing the loss function. Default is 5000. |
save_clusterpath |
If |
target_losses |
The values of the loss function that are used to
determine convergence of the algorithm (tested as: loss - target <=
|
save_losses |
If |
save_convergence_norms |
If |
A cvxclust
object containing the following
info |
A dataframe containing for each value for lambda: the number of different clusters, and the value of the loss function at the minimum. |
merge |
The merge table containing the order at which the
observations in |
height |
The value for lambda at which each reduction in the number of clusters occurs. |
order |
The order of the observations in |
elapsed_time |
The number of seconds that elapsed while
running the code. Note that this does not include the time required for
input checking and possibly scaling and centering |
coordinates |
The clusterpath coordinates. Only part of the
output in case that |
lambdas |
The values for lambda for which a clustering was found. |
eps_fusions |
The threshold for cluster fusions that was used by the algorithm. |
num_clusters |
The different numbers of clusters that have been found. |
n |
The number of observations in |
losses |
Optional: if |
convergence_norms |
Optional: if
|
convex_clustering, sparse_weights
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute clusterpath res = convex_clusterpath(X, W, lambdas) # Get cluster labels for two clusters labels = clusters(res, 2) # Plot the clusterpath with colors based on the cluster labels plot(res, col = labels)
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute clusterpath res = convex_clusterpath(X, W, lambdas) # Get cluster labels for two clusters labels = clusters(res, 2) # Plot the clusterpath with colors based on the cluster labels plot(res, col = labels)
Plot a clusterpath for two-dimensional data.
## S3 method for class 'cvxclust' plot(x, col = NULL, labels = NULL, ...)
## S3 method for class 'cvxclust' plot(x, col = NULL, labels = NULL, ...)
x |
A |
col |
A vector containing cluster membership information. Default is
|
labels |
A vector containing labels for each object. Default is
|
... |
Further graphical parameters. |
A plot in the console.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) plot(res, y + 1)
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) plot(res, y + 1)
Construct a sparse weight matrix in a dictionary-of-keys format.
Each nonzero weight is computed as , where
the squared Euclidean distance may be scaled by the average squared Euclidean
distance, depending on the argument
scale
. Sparsity is achieved by
only setting weights to nonzero values that correspond to two objects that
are among each other's nearest neighbors.
sparse_weights( X, k, phi, connected = TRUE, scale = TRUE, connection_type = "SC" )
sparse_weights( X, k, phi, connected = TRUE, scale = TRUE, connection_type = "SC" )
X |
An |
k |
The number of nearest neighbors to be used for non-zero weights. |
phi |
Tuning parameter of the Gaussian weights. Input should be a nonnegative value. |
connected |
If |
scale |
If |
connection_type |
Determines the method to ensure a connected weight
matrix if |
A sparseweights
object containing the nonzero weights in
dictionary-of-keys format.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0)
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0)
A dataset containing 150 observations generated according to the two interlocking half moons data generating process. The first two columns contain the x and y-coordinates and the third column contains the cluster ID. Each moon contains 75 observations.
data(two_half_moons)
data(two_half_moons)
An object of class data.frame
with 150 rows and 3 columns.