Title: | Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints |
---|---|
Description: | High performance trainers for parameterizing and clustering weighted data. The Gaussian mixture (GM) module includes the conventional EM (expectation maximization) trainer, the component-wise EM trainer, the minimum-message-length EM trainer by Figueiredo and Jain (2002) <doi:10.1109/34.990138>. These trainers accept additional constraints on mixture weights, covariance eigen ratios and on which mixture components are subject to update. The K-means (KM) module offers clustering with the options of (i) deterministic and stochastic K-means++ initializations, (ii) upper bounds on cluster weights (sizes), (iii) Minkowski distances, (iv) cosine dissimilarity, (v) dense and sparse representation of data input. The package improved the typical implementations of GM and KM algorithms in various aspects. It is carefully crafted in multithreaded C++ for modeling large data for industry use. |
Authors: | Charlie Wusuo Liu |
Maintainer: | Charlie Wusuo Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.1.5 |
Built: | 2024-11-20 06:29:05 UTC |
Source: | CRAN |
Convert data from dense representation (matrix) to sparse representation (list of data frames).
d2s( X, zero = 0, threshold = 1e-16, verbose= TRUE )
d2s( X, zero = 0, threshold = 1e-16, verbose= TRUE )
X |
A |
zero |
A numeric value. Elements in |
threshold |
A numeric value, explained above. |
verbose |
A boolean value. |
A list of size N
. Value[[i]]
is a 2-column data frame. The 1st column is a sorted integer vector of the indexes of nonzero dimensions. Values in these dimensions are stored in the 2nd column as a numeric vector.
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) str(sparseX[1:5])
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) str(sparseX[1:5])
The traditional training algorithm via maximum likelihood for parameterizing weighted data with a mixture of Gaussian PDFs. Bounds on covariance matrix eigen ratios and mixture weights are optional.
GM( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, convergenceEPS = 1e-05, convergenceTail = 10, alphaEPS = 0, eigenRatioLim = Inf, embedNoise = 1e-6, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE, updateAlpha = TRUE, updateMean = TRUE, updateSigma = TRUE, checkInitialization = FALSE, KmeansFirst = TRUE, KmeansPPfirst = FALSE, KmeansRandomSeed = NULL, friendlyOutput = TRUE )
GM( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, convergenceEPS = 1e-05, convergenceTail = 10, alphaEPS = 0, eigenRatioLim = Inf, embedNoise = 1e-6, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE, updateAlpha = TRUE, updateMean = TRUE, updateSigma = TRUE, checkInitialization = FALSE, KmeansFirst = TRUE, KmeansPPfirst = FALSE, KmeansRandomSeed = NULL, friendlyOutput = TRUE )
X |
A |
Xw |
A numeric vector of size |
alpha |
A numeric vector of size |
mu |
A |
sigma |
Either a list of |
G |
An integer. If at least one of the parameters |
convergenceEPS |
A numeric value. If the average change of all parameters in the mixture model is below |
convergenceTail |
If every one of the last |
alphaEPS |
A numeric value. During training, if any Gaussian kernel's weight is no greater than |
eigenRatioLim |
A numeric value. During training, if any Gaussian kernel's max:min eigen value ratio exceeds |
embedNoise |
A small constant added to the diagonal entries of all covariance matrices. This may prevent covariance matrices collapsing prematurely. A suggested value is 1e-6. Covariance degeneration is detected during Cholesky decomposition, and will lead the trainer to remove the corresponding mixture component. For high-dimensional problem, setting |
maxIter |
An integer, the maximal number of iterations. |
maxCore |
An integer. The maximal number of threads to invoke. Should be no more than the total number of logical processors on machine. Default 7. |
tlimit |
A numeric value. The program exits with the current model in |
verbose |
A boolean value. |
updateAlpha |
A boolean value or boolean vector. If a boolean value, |
updateMean |
A boolean value or a boolean vector. If a boolean value, |
updateSigma |
A boolean value or a boolean vector. If a boolean value, |
checkInitialization |
Check if any of the input covariance matrices are singular. |
KmeansFirst |
A boolean value. Run K-means clustering for finding means. |
KmeansPPfirst |
A boolean value. Run stochastic K-means++ for K-means initialization. |
KmeansRandomSeed |
An integer or |
friendlyOutput |
|
An in-place Cholesky decomposition of covariance matrix is implemented for space and speed efficiency. Only the upper triangle of the Cholesky decomposition is memorized for each Gaussian kernel, and it is then applied to a forward substitution routine for fast Mahalanobis distances computation. Each of the three main steps in an iteration — Gaussian density computation, parameter inference, parameter update — is multithreaded and hand-scheduled using Intel TBB. Extensive efforts have been made over cache-friendliness and extra multithreading overheads such as memory allocation.
If eigenRatioLim
is finite, eigen decomposition employs routines in RcppArmadillo
.
A list of size 5:
alpha |
a numeric vector of size |
mu |
a |
sigma |
a |
fitted |
a numeric vector of size |
clusterMember |
a list of |
For one-dimensional data, X
should still follow the data structure requirements: a matrix where each column is an observation.
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. gmmRst = GMKMcharlie::GM(X, G = 4L, maxCore = 1L, friendlyOutput = FALSE) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 3L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models are sensitive to initialization. gmmRst2 = GMKMcharlie::GM( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L, friendlyOutput = FALSE) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. gmmFit = GMKMcharlie::GM(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE, friendlyOutput = FALSE) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L gmmFit = GMKMcharlie::GM(X, G = G, maxCore = 1L, verbose = FALSE, KmeansFirst = TRUE, KmeansPPfirst = TRUE, KmeansRandomSeed = 42, friendlyOutput = TRUE) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1:G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[[i]], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. Fix parameters at random. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z); dimnames(X) = NULL d = 3L G = 10L mu = X[, sample(ncol(X), G)] s = matrix(rep(as.numeric(cov(t(X))), G), ncol = G) alpha = rep(1 / G, G) updateAlpha = sample(c(TRUE, FALSE), G, replace = TRUE) updateMean = sample(c(TRUE, FALSE), G, replace = TRUE) updateSigma = sample(c(TRUE, FALSE), G, replace = TRUE) gmmFit = GMKMcharlie::GM(X, alpha = alpha, mu = mu, sigma = s, G = G, maxCore = 2, verbose = FALSE, updateAlpha = updateAlpha, updateMean = updateMean, updateSigma = updateSigma, convergenceEPS = 1e-5, alphaEPS = 1e-8, friendlyOutput = TRUE)
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. gmmRst = GMKMcharlie::GM(X, G = 4L, maxCore = 1L, friendlyOutput = FALSE) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 3L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models are sensitive to initialization. gmmRst2 = GMKMcharlie::GM( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L, friendlyOutput = FALSE) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. gmmFit = GMKMcharlie::GM(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE, friendlyOutput = FALSE) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L gmmFit = GMKMcharlie::GM(X, G = G, maxCore = 1L, verbose = FALSE, KmeansFirst = TRUE, KmeansPPfirst = TRUE, KmeansRandomSeed = 42, friendlyOutput = TRUE) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1:G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[[i]], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. Fix parameters at random. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z); dimnames(X) = NULL d = 3L G = 10L mu = X[, sample(ncol(X), G)] s = matrix(rep(as.numeric(cov(t(X))), G), ncol = G) alpha = rep(1 / G, G) updateAlpha = sample(c(TRUE, FALSE), G, replace = TRUE) updateMean = sample(c(TRUE, FALSE), G, replace = TRUE) updateSigma = sample(c(TRUE, FALSE), G, replace = TRUE) gmmFit = GMKMcharlie::GM(X, alpha = alpha, mu = mu, sigma = s, G = G, maxCore = 2, verbose = FALSE, updateAlpha = updateAlpha, updateMean = updateMean, updateSigma = updateSigma, convergenceEPS = 1e-5, alphaEPS = 1e-8, friendlyOutput = TRUE)
The component-wise variant of GM()
.
GMcw( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, convergenceEPS = 1e-05, alphaEPS = 0, eigenRatioLim = Inf, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE )
GMcw( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, convergenceEPS = 1e-05, alphaEPS = 0, eigenRatioLim = Inf, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE )
X |
A |
Xw |
A numeric vector of size |
alpha |
A numeric vector of size |
mu |
A |
sigma |
A |
G |
An integer. If at least one of the parameters |
convergenceEPS |
A numeric value. If the average change of all parameters in the mixture model is below |
alphaEPS |
A numeric value. During training, if any Gaussian kernel's weight is no greater than |
eigenRatioLim |
A numeric value. During training, if any Gaussian kernel's max:min eigen value ratio exceeds |
maxIter |
An integer, the maximal number of iterations. |
maxCore |
An integer. The maximal number of threads to invoke. Should be no more than the total number of logical processors on machine. Default 7. |
tlimit |
A numeric value. The program exits with the current model in |
verbose |
A boolean value. |
Relevant details can be found in GM()
. In GMcw()
, an update of any Gaussian kernel triggers the update of the underlying weighing matrix that directs the update of all Gaussian kernels. Only after that the next Gaussian kernel is updated. See references.
In the actual implementation, the N x K
weighing matrix WEI
does not exist in memory. An N x K
density matrix DEN
instead stores each Gaussian kernel's probability density at every observation in X
. Mathematically, the i
th column of WEI
equals DEN
's i
th column divided by the row sum RS
. RS
is a vector of size N
and is memorized and updated responding to the update of each Gaussian kernel: before updating the i
th kernel, the algorithm subtracts the i
th column of DEN
from RS
; after the kernel is updated and the probability densities are recomputed, the algorithm adds back the i
th column of DEN
to RS
. Now, to update the i+1
th Gaussian kernel, we can divide the i+1
th column of DEN
by RS
to get the weighing coefficients.
The above implementation makes the component-wise trainer comparable to the classic one in terms of speed. The component-wise trainer is a key component in Figuredo & jain's MML (minimum message length) mixture model trainer to avoid premature deaths of the Gaussian kernels.
A list of size 5:
alpha |
a numeric vector of size |
mu |
a |
sigma |
a |
fitted |
a numeric vector of size |
clusterMember |
a list of |
For one-dimensional data, X
should still follow the data structure requirements: a matrix where each column is an observation.
Celeux, Gilles, et al. "A Component-Wise EM Algorithm for Mixtures." Journal of Computational and Graphical Statistics, vol. 10, no. 4, 2001, pp. 697-712. JSTOR, www.jstor.org/stable/1390967.
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. gmmRst = GMKMcharlie::GMcw(X, G = 3L, maxCore = 1L) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 3L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models could be different given a different initialization. gmmRst2 = GMKMcharlie::GMcw( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. gmmFit = GMKMcharlie::GMcw(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L gmmFit = GMKMcharlie::GMcw(X, G = G, maxCore = 1L, verbose = FALSE) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1L : G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[, i], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar)
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. gmmRst = GMKMcharlie::GMcw(X, G = 3L, maxCore = 1L) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 3L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models could be different given a different initialization. gmmRst2 = GMKMcharlie::GMcw( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. gmmFit = GMKMcharlie::GMcw(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L gmmFit = GMKMcharlie::GMcw(X, G = G, maxCore = 1L, verbose = FALSE) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1L : G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[, i], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar)
Figueiredo and Jain's Gaussian mixture trainer with all options in GM()
.
GMfj( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, Gmin = 2L, convergenceEPS = 1e-05, alphaEPS = 0, eigenRatioLim = Inf, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE )
GMfj( X, Xw = rep(1, ncol(X)), alpha = numeric(0), mu = matrix(ncol = 0, nrow = 0), sigma = matrix(ncol = 0, nrow = 0), G = 5L, Gmin = 2L, convergenceEPS = 1e-05, alphaEPS = 0, eigenRatioLim = Inf, maxIter = 1000L, maxCore = 7L, tlimit = 3600, verbose = TRUE )
X |
A |
Xw |
A numeric vector of size |
alpha |
A numeric vector of size |
mu |
A |
sigma |
A |
G |
An integer. If at least one of the parameters |
Gmin |
An integer. The final model should have at least |
convergenceEPS |
A numeric value. If the average change of all parameters in the mixture model is below |
alphaEPS |
A numeric value. During training, if any Gaussian kernel's weight is no greater than |
eigenRatioLim |
A numeric value. During training, if any Gaussian kernel's max:min eigen value ratio exceeds |
maxIter |
An integer, the maximal number of iterations. |
maxCore |
An integer. The maximal number of threads to invoke. Should be no more than the total number of logical processors on machine. Default 7. |
tlimit |
A numeric value. The program exits with the current model in |
verbose |
A boolean value. |
Although heavily cited, the paper has some misleading information and the algorithm's performance does not live up to its reputation. See <https://stats.stackexchange.com/questions/423935/figueiredo-and-jains-gaussian-mixture-em-convergence-criterion>. Nevertheless, it is a worthwhile algorithm to try in practice.
A list of size 5:
alpha |
a numeric vector of size |
mu |
a |
sigma |
a |
fitted |
a numeric vector of size |
clusterMember |
a list of |
For one-dimensional data, X
should still follow the data structure requirements: a matrix where each column is an observation.
Mario A.T. Figueiredo & Anil K. Jain (2002): "Unsupervised learning of finite mixture models." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3): 381-396.
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. system.time({gmmRst = GMKMcharlie::GMfj( X, G = 25L, Gmin = 2L, maxCore = 1L, verbose = FALSE)}) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 25L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models are sensitive to initialization. system.time({gmmRst2 = GMKMcharlie::GMfj( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L, verbose = FALSE)}) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. system.time({gmmFit = GMKMcharlie::GMfj( X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE)}) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L system.time({gmmFit = GMKMcharlie::GMfj( X, G = G, maxCore = 1L, verbose = FALSE)}) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1L : G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[, i], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar)
# ============================================================================= # Examples below use 1 thread to pass CRAN check. Speed advantage of multiple # threads will be more pronounced for larger data. # ============================================================================= # ============================================================================= # Parameterize the iris data. Let the function initialize Gaussian kernels. # ============================================================================= X = t(iris[1:4]) # CRAN check only allows 2 threads at most. Increase `maxCore` for # acceleration. system.time({gmmRst = GMKMcharlie::GMfj( X, G = 25L, Gmin = 2L, maxCore = 1L, verbose = FALSE)}) str(gmmRst) # ============================================================================= # Parameterize the iris data given Gaussian kernels. # ============================================================================= G = 25L d = nrow(X) # Dimensionality. alpha = rep(1, G) / G mu = X[, sample(ncol(X), G)] # Sample observations as initial means. # Take the average variance and create initial covariance matrices. meanVarOfEachDim = sum(diag(var(t(X)))) / d covar = diag(meanVarOfEachDim / G, d) covars = matrix(rep(as.numeric(covar), G), nrow = d * d) # Models are sensitive to initialization. system.time({gmmRst2 = GMKMcharlie::GMfj( X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L, verbose = FALSE)}) str(gmmRst2) # ============================================================================= # For fun, fit Rosenbrock function with a Gaussian mixture. # ============================================================================= set.seed(123) rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2} N = 2000L x = runif(N, -2, 2) y = runif(N, -1, 3) z = rosenbrock(x, y) X = rbind(x, y) Xw = z * (N / sum(z)) # Weights on observations should sum up to N. system.time({gmmFit = GMKMcharlie::GMfj( X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE)}) oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x, y, gmmFit$fitted, pch = 20) par(mfrow = oldpar) # ============================================================================= # For fun, fit a 3D spiral distribution. # ============================================================================= N = 2000 t = runif(N) ^ 2 * 15 x = cos(t) + rnorm(N) * 0.1 y = sin(t) + rnorm(N) * 0.1 z = t + rnorm(N) * 0.1 X = rbind(x, y, z) d = 3L G = 10L system.time({gmmFit = GMKMcharlie::GMfj( X, G = G, maxCore = 1L, verbose = FALSE)}) # Sample N points from the Gaussian mixture. ns = as.integer(round(N * gmmFit$alpha)) sampledPoints = list() for(i in 1L : G) { sampledPoints[[i]] = MASS::mvrnorm( ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[, i], nrow = d)) } sampledPoints = matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d) # Plot the original data and the samples from the mixture model. oldpar = par()$mfrow par(mfrow = c(1, 2)) plot3D::points3D(x, y, z, pch = 20) plot3D::points3D(x = sampledPoints[1, ], y = sampledPoints[2, ], z = sampledPoints[3, ], pch = 20) par(mfrow = oldpar)
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data.
KM( X, centroid, Xw = rep(1, ncol(X)), minkP = 2, maxIter = 100L, maxCore = 7L, verbose = TRUE )
KM( X, centroid, Xw = rep(1, ncol(X)), minkP = 2, maxIter = 100L, maxCore = 7L, verbose = TRUE )
X |
A |
centroid |
A |
Xw |
A numeric vector of size |
minkP |
A numeric value or a character string. If numeric, |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
Implementation highlights include:
(i) In Minkowski distance calculation, integer power no greater than 30 uses multiplications. Fractional powers or powers above 30 call std::pow()
.
(ii) Multithreaded observation-centroid distance calculations. Distances are memorized to avoid unnecessary recomputations if centroids did not change in the last iteration.
(iii) A lookup table is built for storing observation - centroid ID pairs during the assignment step. Observation IDs are then grouped by centroid IDs which allows parallel computing cluster means.
(iv) Function allows non-uniform weights on observations.
(v) Meta-template programming trims branches over different distance functions and other computing methods during compile time.
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility.
# =========================================================================== # Play random numbers. See speed. # =========================================================================== N = 5000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. dat = matrix(rnorm(N * d) + runif(N * d), nrow = d) # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] # Euclidean. system.time({rst = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)}) # Cosine dissimilarity. dat = apply(dat, 2, function(x) x / sum(x ^ 2) ^ 0.5) centroid = dat[, centroidInd] system.time({rst2 = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = "cosine", maxCore = 2, verbose = TRUE)}) # =========================================================================== # Test against R's inbuilt km() # =========================================================================== dat = t(iris[1:4]) dimnames(dat) = NULL # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K = 3, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] rst = GMKMcharlie::KM(X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE) rst = lapply(rst, function(x) sort(x$clusterMember)) rst2 = kmeans(x = t(dat), centers = t(centroid), algorithm = "Lloyd") rst2 = aggregate(list(1L : length(rst2$cluster)), list(rst2$cluster), function(x) sort(x))[[2]] setdiff(rst, rst2)
# =========================================================================== # Play random numbers. See speed. # =========================================================================== N = 5000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. dat = matrix(rnorm(N * d) + runif(N * d), nrow = d) # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] # Euclidean. system.time({rst = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)}) # Cosine dissimilarity. dat = apply(dat, 2, function(x) x / sum(x ^ 2) ^ 0.5) centroid = dat[, centroidInd] system.time({rst2 = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = "cosine", maxCore = 2, verbose = TRUE)}) # =========================================================================== # Test against R's inbuilt km() # =========================================================================== dat = t(iris[1:4]) dimnames(dat) = NULL # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K = 3, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] rst = GMKMcharlie::KM(X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE) rst = lapply(rst, function(x) sort(x$clusterMember)) rst2 = kmeans(x = t(dat), centers = t(centroid), algorithm = "Lloyd") rst2 = aggregate(list(1L : length(rst2$cluster)), list(rst2$cluster), function(x) sort(x))[[2]] setdiff(rst, rst2)
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.
KMconstrained( X, centroid, Xw = rep(1, ncol(X)), clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)), minkP = 2, convergenceTail = 5L, tailConvergedRelaErr = 1e-04, maxIter = 100L, maxCore = 7L, paraSortInplaceMerge = FALSE, verbose = TRUE )
KMconstrained( X, centroid, Xw = rep(1, ncol(X)), clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)), minkP = 2, convergenceTail = 5L, tailConvergedRelaErr = 1e-04, maxIter = 100L, maxCore = 7L, paraSortInplaceMerge = FALSE, verbose = TRUE )
X |
A |
centroid |
A |
Xw |
A numeric vector of size |
clusterWeightUB |
An integer vector of size |
minkP |
A numeric value or a character string. If numeric, |
convergenceTail |
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last |
tailConvergedRelaErr |
A numeric value, explained in |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
paraSortInplaceMerge |
A boolean value. |
verbose |
A boolean value. |
See details in KM()
for common implementation highlights. Weight upper bounds are implemented as follows:
In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.
N = 3000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. dat = matrix(rnorm(N * d) + runif(N * d), nrow = d) # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] # Each cluster size should not be greater than N / K * 2. sizeConstraints = as.integer(rep(N / K * 2, K)) system.time({rst = GMKMcharlie::KMconstrained( X = dat, centroid = centroid, clusterWeightUB = sizeConstraints, maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)}) # Size upper bounds vary in [N / K * 1.5, N / K * 2] sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2))) system.time({rst = GMKMcharlie::KMconstrained( X = dat, centroid = centroid, clusterWeightUB = sizeConstraints, maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
N = 3000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. dat = matrix(rnorm(N * d) + runif(N * d), nrow = d) # Use kmeans++ initialization. centroidInd = GMKMcharlie::KMppIni( X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE) centroid = dat[, centroidInd] # Each cluster size should not be greater than N / K * 2. sizeConstraints = as.integer(rep(N / K * 2, K)) system.time({rst = GMKMcharlie::KMconstrained( X = dat, centroid = centroid, clusterWeightUB = sizeConstraints, maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)}) # Size upper bounds vary in [N / K * 1.5, N / K * 2] sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2))) system.time({rst = GMKMcharlie::KMconstrained( X = dat, centroid = centroid, clusterWeightUB = sizeConstraints, maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data given cluster size (weight) constraints.
KMconstrainedSparse( X, d, centroid, Xw = rep(1, length(X)), clusterWeightUB = rep(length(X) + 1, length(centroid)), minkP = 2, convergenceTail = 5L, tailConvergedRelaErr = 1e-04, maxIter = 100L, maxCore = 7L, paraSortInplaceMerge = FALSE, verbose = TRUE )
KMconstrainedSparse( X, d, centroid, Xw = rep(1, length(X)), clusterWeightUB = rep(length(X) + 1, length(centroid)), minkP = 2, convergenceTail = 5L, tailConvergedRelaErr = 1e-04, maxIter = 100L, maxCore = 7L, paraSortInplaceMerge = FALSE, verbose = TRUE )
X |
A list of size |
d |
An integer. The dimensionality of |
centroid |
A list of size |
Xw |
A numeric vector of size |
clusterWeightUB |
An integer vector of size |
minkP |
A numeric value or a character string. If numeric, |
convergenceTail |
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last |
tailConvergedRelaErr |
A numeric value, explained in |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
paraSortInplaceMerge |
A boolean value. |
verbose |
A boolean value. |
See details for KMconstrained()
and KM()
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.
N = 5000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. # Create a data matrix, about 95% of which are zeros. dat = matrix(unlist(lapply(1L : N, function(x) { tmp = numeric(d) # Nonzero entries. Nnz = as.integer(max(1, d * runif(1, 0, 0.05))) tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz) tmp })), nrow = d); gc() # Convert to sparse representation. # GMKMcharlie::d2s() is equivalent. sparsedat = apply(dat, 2, function(x) { nonz = which(x != 0) list(nonz, x[nonz]) }); gc() centroidInd = sample(length(sparsedat), K) # Test speed using sparse representation. sparseCentroid = sparsedat[centroidInd] # Size upper bounds vary in [N / K * 1.5, N / K * 2] sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2))) system.time({sparseRst = GMKMcharlie::KMconstrainedSparse( X = sparsedat, d = d, centroid = sparseCentroid, clusterWeightUB = sizeConstraints, tailConvergedRelaErr = 1e-6, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
N = 5000L # Number of points. d = 500L # Dimensionality. K = 50L # Number of clusters. # Create a data matrix, about 95% of which are zeros. dat = matrix(unlist(lapply(1L : N, function(x) { tmp = numeric(d) # Nonzero entries. Nnz = as.integer(max(1, d * runif(1, 0, 0.05))) tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz) tmp })), nrow = d); gc() # Convert to sparse representation. # GMKMcharlie::d2s() is equivalent. sparsedat = apply(dat, 2, function(x) { nonz = which(x != 0) list(nonz, x[nonz]) }); gc() centroidInd = sample(length(sparsedat), K) # Test speed using sparse representation. sparseCentroid = sparsedat[centroidInd] # Size upper bounds vary in [N / K * 1.5, N / K * 2] sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2))) system.time({sparseRst = GMKMcharlie::KMconstrainedSparse( X = sparsedat, d = d, centroid = sparseCentroid, clusterWeightUB = sizeConstraints, tailConvergedRelaErr = 1e-6, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
Find suitable observations as initial centroids.
KMppIni( X, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = 123, maxCore = 7L, verbose = TRUE )
KMppIni( X, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = 123, maxCore = 7L, verbose = TRUE )
X |
A |
K |
An integer, the number of centroids. |
firstSelection |
An integer, index of the observation selected as the first initial centroid in |
minkP |
A numeric value or a character string. If numeric, |
stochastic |
A boolean value. |
seed |
Random seed if |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
In each iteration, the distances between the newly selected centroid and all the other observations are computed with multiple threads. Scheduling is homemade for minimizing the overhead of thread communication.
An integer vector of size K
. The vector contains the indexes of observations selected as the initial centroids.
N = 30000L d = 300L K = 30L X = matrix(rnorm(N * d) + 2, nrow = d) # CRAN check allows examples invoking 2 threads at most. Change `maxCore` # for acceleration. kmppSt = KMppIni(X, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = sample(1e9L, 1), maxCore = 2L) kmppDt = KMppIni(X, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, maxCore = 2L) str(kmppSt) str(kmppDt)
N = 30000L d = 300L K = 30L X = matrix(rnorm(N * d) + 2, nrow = d) # CRAN check allows examples invoking 2 threads at most. Change `maxCore` # for acceleration. kmppSt = KMppIni(X, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = sample(1e9L, 1), maxCore = 2L) kmppDt = KMppIni(X, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, maxCore = 2L) str(kmppSt) str(kmppDt)
Find suitable observations as initial centroids.
KMppIniSparse( X, d, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = 123, maxCore = 7L, verbose = TRUE )
KMppIniSparse( X, d, K, firstSelection = 1L, minkP = 2, stochastic = FALSE, seed = 123, maxCore = 7L, verbose = TRUE )
X |
A list of size |
d |
An integer. The dimensionality of |
K |
An integer, the number of centroids. |
firstSelection |
An integer, index of the observation selected as the first initial centroid in |
minkP |
A numeric value or a character string. If numeric, |
stochastic |
A boolean value. |
seed |
Random seed if |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
In each iteration, the distances between the newly selected centroid and all the other observations are computed with multiple threads. Scheduling is homemade for minimizing the overhead of thread communication.
An integer vector of size K
. The vector contains the indexes of observations selected as the initial centroids.
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) K = 30L seed = 123L # Time cost of finding the centroids via dense representation. # CRAN check allows only 2 threads. Increase `maxCore` for more speed. system.time({kmppViaDense = GMKMcharlie::KMppIni( X, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = seed, maxCore = 2L)}) # Time cost of finding the initial centroids via sparse representation. system.time({kmppViaSparse = GMKMcharlie::KMppIniSparse( sparseX, d, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = seed, maxCore = 2L)}) # Results should be identical. sum(kmppViaSparse - kmppViaDense)
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) K = 30L seed = 123L # Time cost of finding the centroids via dense representation. # CRAN check allows only 2 threads. Increase `maxCore` for more speed. system.time({kmppViaDense = GMKMcharlie::KMppIni( X, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = seed, maxCore = 2L)}) # Time cost of finding the initial centroids via sparse representation. system.time({kmppViaSparse = GMKMcharlie::KMppIniSparse( sparseX, d, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = seed, maxCore = 2L)}) # Results should be identical. sum(kmppViaSparse - kmppViaDense)
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data.
KMsparse( X, d, centroid, Xw = rep(1, length(X)), minkP = 2, maxIter = 100L, maxCore = 7L, verbose = TRUE )
KMsparse( X, d, centroid, Xw = rep(1, length(X)), minkP = 2, maxIter = 100L, maxCore = 7L, verbose = TRUE )
X |
A list of size |
d |
An integer. The dimensionality of |
centroid |
A list of size |
Xw |
A numeric vector of size |
minkP |
A numeric value or a character string. If numeric, |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
See details in KM()
for implementation highlights. There are some other optimizations such as, except for the maximum norm, cost of computing the distance between a dense centroid vector and a sparse observation is linear to the size of the sparse observation, which should be largely less than the size of the dense vector. This is done by letting every centroid memorize its before-root Minkowski norm. The full distance can then be inferred from adding the residual norm to the partial distance.
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility.
# =========================================================================== # Play random numbers. See speed. # =========================================================================== N = 10000L # Number of points. d = 500L # Dimensionality. K = 100L # Number of clusters. # Create a data matrix, about 95% of which are zeros. dat = matrix(unlist(lapply(1L : N, function(x) { tmp = numeric(d) # Nonzero entries. Nnz = as.integer(max(1, d * runif(1, 0, 0.05))) tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz) tmp })), nrow = d); gc() # Convert to sparse representation. # GMKMcharlie::d2s() acheives the same. sparsedat = apply(dat, 2, function(x) { nonz = which(x != 0) list(nonz, x[nonz]) }); gc() centroidInd = sample(length(sparsedat), K) # Test speed using dense representation. centroid = dat[, centroidInd] system.time({rst = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)}) # Test speed using sparse representation. sparseCentroid = sparsedat[centroidInd] system.time({sparseRst = GMKMcharlie::KMsparse( X = sparsedat, d = d, centroid = sparseCentroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
# =========================================================================== # Play random numbers. See speed. # =========================================================================== N = 10000L # Number of points. d = 500L # Dimensionality. K = 100L # Number of clusters. # Create a data matrix, about 95% of which are zeros. dat = matrix(unlist(lapply(1L : N, function(x) { tmp = numeric(d) # Nonzero entries. Nnz = as.integer(max(1, d * runif(1, 0, 0.05))) tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz) tmp })), nrow = d); gc() # Convert to sparse representation. # GMKMcharlie::d2s() acheives the same. sparsedat = apply(dat, 2, function(x) { nonz = which(x != 0) list(nonz, x[nonz]) }); gc() centroidInd = sample(length(sparsedat), K) # Test speed using dense representation. centroid = dat[, centroidInd] system.time({rst = GMKMcharlie::KM( X = dat, centroid = centroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)}) # Test speed using sparse representation. sparseCentroid = sparsedat[centroidInd] system.time({sparseRst = GMKMcharlie::KMsparse( X = sparsedat, d = d, centroid = sparseCentroid, maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
Convert data from sparse representation (list of data frames) to dese representation (matrix).
s2d( X, d, zero = 0, verbose = TRUE )
s2d( X, d, zero = 0, verbose = TRUE )
X |
A list of size |
d |
An integer. The dimensionality of |
zero |
A numeric value. In the result matrix, entries not registered in |
verbose |
A boolean value. |
A d x N
numeric matrix.
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) # Convert it back to dense. X2 = GMKMcharlie::s2d(sparseX, d) range(X - X2)
N = 2000L d = 3000L X = matrix(rnorm(N * d) + 2, nrow = d) # Fill many zeros in X: X = apply(X, 2, function(x) { x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x}) # Get the sparse version of X. sparseX = GMKMcharlie::d2s(X) # Convert it back to dense. X2 = GMKMcharlie::s2d(sparseX, d) range(X - X2)