Title: | Covariate Adaptive Clustering |
---|---|
Description: | Implements the predictive k-means method for clustering observations, using a mixture of experts model to allow covariates to influence cluster centers. Motivated by air pollution epidemiology settings, where cluster membership needs to be predicted across space. Includes functions for predicting cluster membership using spatial splines and principal component analysis (PCA) scores using either multinomial logistic regression or support vector machines (SVMs). For method details see Keller et al. (2017) <doi:10.1214/16-AOAS992>. |
Authors: | Joshua Keller |
Maintainer: | Joshua Keller <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 0.1.1 |
Built: | 2024-12-30 09:08:04 UTC |
Source: | CRAN |
Clusters multivariate exposures, using a mixture of experts model to allow covariates to influence cluster centers. Motivated by air pollution epidemiology settings, where cluster membership needs to be predicted across space. Includes functions for predicting cluster membership using spatial splines and principal component analysis (PCA) scores using either multinomial logistic regression or support vector machines (SVMs). For method details see Keller et al. (2017) <doi:10.1214/16-AOAS992>
Assigns observation to the nearest cluster center, using squared Euclidean distance.
assignCluster(X, centers)
assignCluster(X, centers)
X |
matrix of observations |
centers |
matrix of cluster centers |
A vector of cluster labels
Joshua Keller
X <- matrix(rnorm(100*5), nrow=100, ncol=5) centers <- matrix(runif(3*5), nrow=3, ncol=5) cl <- assignCluster(X, centers) table(cl)
X <- matrix(rnorm(100*5), nrow=100, ncol=5) centers <- matrix(runif(3*5), nrow=3, ncol=5) cl <- assignCluster(X, centers) table(cl)
Splits a vector of observation names or indices into a list of k groups, to be used as cross-validation (CV) test groups.
createCVgroups(x = NULL, n = length(x), k = 10, useNames = TRUE)
createCVgroups(x = NULL, n = length(x), k = 10, useNames = TRUE)
x |
vector of observation ID's (character or numeric) to split into cv groups. |
n |
number of observations to split into cv groups. Defaults to the length of |
k |
number of cross-validation groups. Must be less than or equal to |
useNames |
logical indicator of whether the names of 'x' should be used to identify observations within cv groups. |
A list of length k
giving the IDs of observations within each test group.
Joshua Keller
predkmeansCVest predkmeansCVpred
# 5-fold groups cv5 <- createCVgroups(n=100, k=5) cv5 # Leave-one-out cvLOO <- createCVgroups(n=100, k=0) cvLOO
# 5-fold groups cv5 <- createCVgroups(n=100, k=5) cv5 # Leave-one-out cvLOO <- createCVgroups(n=100, k=0) cvLOO
Wrapper function for creating PCA scores to be used in a regression analysis.
createPCAmodelmatrix( data, ncomps, covarnames = colnames(data), center = TRUE, scale = TRUE, matrixonly = TRUE )
createPCAmodelmatrix( data, ncomps, covarnames = colnames(data), center = TRUE, scale = TRUE, matrixonly = TRUE )
data |
Matrix or data frame of data |
ncomps |
Number of PCA components to return. |
covarnames |
Names of variables or column numbers in |
center |
Logical indicator of whether |
scale |
Logical indicator of whether |
matrixonly |
Logical indicator of whether only the model matrix should
be returned, or the full output from |
This is a wrapper around prcomp
, which does
the necessary computation.
If matrixonly=TRUE
, a matrix of PCA scores. Otherwise a list containing two elements: X
, a matrix of scores, and pca
, the output from prcomp
.
Joshua Keller
createTPRSmodelmatrix
, predkmeansCVest
n <- 100 d <- 15 X <- matrix(rnorm(n*d), ncol=d, nrow=n) X <- as.data.frame(X) mx <- createPCAmodelmatrix(data=X, ncomps=2)
n <- 100 d <- 15 X <- matrix(rnorm(n*d), ncol=d, nrow=n) X <- as.data.frame(X) mx <- createPCAmodelmatrix(data=X, ncomps=2)
Wrapper function for creating a matrix of thin-plate regression splines (TPRS) to be used in a regression analysis.
createTPRSmodelmatrix( data, df = 5, covarnames = NULL, xname = "x", yname = "y", TPRSfx = TRUE, matrixonly = TRUE )
createTPRSmodelmatrix( data, df = 5, covarnames = NULL, xname = "x", yname = "y", TPRSfx = TRUE, matrixonly = TRUE )
data |
Matrix or data frame of data |
df |
Degrees of freedom for thinplate splines. This does not include an intercept, so the |
covarnames |
Names of other covariates to be included in the model matrix. |
xname |
Name of variable the provides the x-coordinate of location. |
yname |
Name of variable the provides the y-coordinate of location. |
TPRSfx |
Should the TPRS degrees of freedom be fixed. Passed as the |
matrixonly |
Logical indicator of whether only the model matrix should
be returned, or the full output from |
Joshua Keller
createPCAmodelmatrix
, predkmeansCVest
n <- 200 x <- runif(n=n, 0, 100) y <- runif(n=n, 0, 100) d <- data.frame(x=x, y=y) mx <- createTPRSmodelmatrix(data=d, df=5)
n <- 200 x <- runif(n=n, 0, 100) y <- runif(n=n, 0, 100) d <- data.frame(x=x, y=y) mx <- createTPRSmodelmatrix(data=d, df=5)
Solves a multinomial logistic problem using Newton-Raphson method
mlogit( Y, X, beta = NULL, add.intercept = FALSE, betaOnly = FALSE, tol.zero = 1e-08, verbose = T, suppressFittedWarning = FALSE, maxNR.print.level = 0, iterlim = 150, checkY = TRUE )
mlogit( Y, X, beta = NULL, add.intercept = FALSE, betaOnly = FALSE, tol.zero = 1e-08, verbose = T, suppressFittedWarning = FALSE, maxNR.print.level = 0, iterlim = 150, checkY = TRUE )
Y |
A matrix of the outcomes, with K columns for the K groups. Row sums of the matrix should be equal to one, but entries do not have to be 0/1 (but they should be positive). i.e. this is a matrix of hard or soft assignments to K categories. The first column is used as the reference category. |
X |
matrix of covariates for regression. Should have the same number of rows (observations) as Y. Coefficients for all parameters in X are computed for K-1 groups. The coefficients corresponding to the first column of Y are assumed to be zero. |
beta |
starting values for the optimziation. Should be given as a matrix of column vectors, each vector a different starting value. If null, defaults to zeros. |
add.intercept |
a logical indicator of whether an intercept column should be added to X |
betaOnly |
logical indicator of whether only the parameter estimates beta should be returned. Otherwise, beta is returned along with fitted objects. See Output. |
tol.zero |
the tolerance threshold for considering a fitted value as equal to zero. Used for warning about fitted values of 0/1. Is NOT part of the optimization control parameters. |
verbose |
logical indicator that controls whether text indicating progress is output to display |
suppressFittedWarning |
indicator of whether or not warnings about fitted values of 1 are returned |
maxNR.print.level |
numeric value giving the level of
output produced by maxNR. see |
iterlim |
iteration limit for maxNR. Defaults to 150. |
checkY |
indicator for whether Y should be checked to be a valid assignment matrix. Set to |
The optimization is done using the maxNR
function from the maxLik
package. The log-likehood function, along with its gradient and hessian, are implemented as C++ functions (via the RcppArmadillo
package).
A list containing the following:
beta |
a p x K matrix of parameter estimates corresponding to the K columns of Y and p covariates in X |
fitted01 |
indicator of whether fitted values of 1 were present. |
fitted |
the fitted probabilities |
res.best |
the best result from the maxNR fit |
status |
small data frame summarizing the status of the fits |
res. all |
a list containing the results from all maxNR fits |
Joshua Keller
n <- 2000 X <- cbind(1, matrix(rnorm(2*n), nrow=n, ncol=2), rbinom(n, size=1, prob=0.3)) beta <- cbind(rep(0, 4), c(0.5, 1, 0, -1), c(0, 2, 2, 0)) probs <- exp(X %*% beta) probs <- probs/rowSums(probs) Y <- t(apply(probs, 1, function(p) rmultinom(1, 1, p))) mfit <- mlogit(Y=Y, X=X, betaOnly=TRUE) mfit
n <- 2000 X <- cbind(1, matrix(rnorm(2*n), nrow=n, ncol=2), rbinom(n, size=1, prob=0.3)) beta <- cbind(rep(0, 4), c(0.5, 1, 0, -1), c(0, 2, 2, 0)) probs <- exp(X %*% beta) probs <- probs/rowSums(probs) Y <- t(apply(probs, 1, function(p) rmultinom(1, 1, p))) mfit <- mlogit(Y=Y, X=X, betaOnly=TRUE) mfit
Computes several measures of performance for cluster label prediction.
predictionMetrics(centers, cluster.pred, X, labels = TRUE)
predictionMetrics(centers, cluster.pred, X, labels = TRUE)
centers |
Matrix of Cluster centers |
cluster.pred |
Vector of predicted cluster membership. Should be integers
or names corresponding to rows of |
X |
Matrix of observations at prediction locations. |
labels |
Logical indicating whether cluster prediction and |
A list with the following elements:
MSPE |
Mean squared prediction error. Sum of squared distances between observations and predicted cluster centers. |
wSS |
Within-cluster sum-of-squares. Sum of squared distances between observations at prediction locations and best (i.e. closest) cluster center. |
MSME |
Mean squared misclassification error. Sum of squared distances between predicted cluster center and best (i.e. closest) cluster center. |
pred.acc |
Proportion of cluster labels correctly predicted. |
cluster.pred |
Predicted cluster assignments (same as argument provided). |
cluster.assign |
Integer vector of 'best' cluster assignments (i.e. assignment to closest cluster center) |
Joshua Keller
Keller, J.P., Drton, M., Larson, T., Kaufman, J.D., Sandler, D.P., and Szpiro, A.A. (2017). Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts. Annals of Applied Statistics, 11(1):93–113.
n <- 100 d <- 5 # Dimension of exposure K <- 3 # Number of clusters X <- matrix(rnorm(n*d), ncol=d, nrow=n) centers <- matrix(runif(d*K), nrow=K, ncol=d) cluster_pred <- sample(1:K, size=n, replace=TRUE) metrics <- predictionMetrics(centers, cluster.pred=cluster_pred, X=X) metrics[c("MSPE", "wSS", "MSME", "pred.acc")]
n <- 100 d <- 5 # Dimension of exposure K <- 3 # Number of clusters X <- matrix(rnorm(n*d), ncol=d, nrow=n) centers <- matrix(runif(d*K), nrow=K, ncol=d) cluster_pred <- sample(1:K, size=n, replace=TRUE) metrics <- predictionMetrics(centers, cluster.pred=cluster_pred, X=X) metrics[c("MSPE", "wSS", "MSME", "pred.acc")]
Predicts cluster membership using either multinomial logistic regression or SVMs.
## S3 method for class 'predkmeans' predictML( object = NULL, centers = object$centers, K = nrow(centers), R, Rstar, Xstar = NULL, tr.assign = object$cluster, muStart = "random", maxitMlogit = 500, verbose = 1, nMlogitStarts = 1, mlogit.control = list(suppressFittedWarning = TRUE), ... ) ## S3 method for class 'predkmeans' predictSVM( object = NULL, centers = object$centers, R, Rstar, K = nrow(centers), Xstar = NULL, tr.assign = object$cluster, svm.control = list(gamma = c(1/(2:1), 2), cost = seq(20, 100, by = 20)), ... ) ## S3 method for class 'predkmeans' predictMixExp(object, R, Rstar = NULL, ...)
## S3 method for class 'predkmeans' predictML( object = NULL, centers = object$centers, K = nrow(centers), R, Rstar, Xstar = NULL, tr.assign = object$cluster, muStart = "random", maxitMlogit = 500, verbose = 1, nMlogitStarts = 1, mlogit.control = list(suppressFittedWarning = TRUE), ... ) ## S3 method for class 'predkmeans' predictSVM( object = NULL, centers = object$centers, R, Rstar, K = nrow(centers), Xstar = NULL, tr.assign = object$cluster, svm.control = list(gamma = c(1/(2:1), 2), cost = seq(20, 100, by = 20)), ... ) ## S3 method for class 'predkmeans' predictMixExp(object, R, Rstar = NULL, ...)
object |
A predkmeans object, from which the cluster centers will be extracted. |
centers |
Matrix of cluster centers, assumed to be K-by-p |
K |
Number of clusters |
R |
matrix of covariates for observations to be predicted at. |
Rstar |
matrix of covariates at training locations |
Xstar |
matrix of observation at training locations. Either this or |
tr.assign |
vector of cluster assignments at training locations. By default, extracted from |
muStart |
starting value for cluster centers in mlogit optimization (IDEA: change to pull from predkmeans object?). If not provided, starting values are selected randomly. |
maxitMlogit |
Maximum number of iterations for |
verbose |
integer indicating amount of output to be displayed |
nMlogitStarts |
number of mlogit starts to use in estimation of parameters |
mlogit.control |
list of control parameters to be passes to |
... |
Unused additional arguments |
svm.control |
list of options for |
Function for predicting cluster membership in clusters identified by k-means or predictive k-means using multinomial logistic regression or support vector machines (SVMs). For multinomial logitic regression, parameter estimation is handled by mlogit
. The SVMs are fit using best.svm
from e1071
package.
Because this prediction includes return information about cluster assignment
and prediction model parameters, this method is deliberately distinct from
the generic predict
functions.
The predictMixExp
funciton provides predictions from
the 'working' cluster assignments created as part of the
mixture of experts algorithm from predkmeans
.
A list containing some or all of the following elements:
tr.assign |
Cluster assignments at training locations |
mlfit |
A subset of the mlogit object returned by the function of that name |
beta |
Estimated model parameters |
test.pred |
Predicted cluster assignments at test locations |
Joshua Keller
mlogit
, predkmeans
, predictionMetrics
Other methods for predkmeans objects:
relevel.predkmeans()
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- cbind(1, r1, r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) n_pred <- 50 Rnew <- cbind(1, r1=rnorm(n_pred), r2=rnorm(n_pred)) pkmPred <- predictML(pkm, R=Rnew, Rstar=R) pkmPred$test.pred
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- cbind(1, r1, r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) n_pred <- 50 Rnew <- cbind(1, r1=rnorm(n_pred), r2=rnorm(n_pred)) pkmPred <- predictML(pkm, R=Rnew, Rstar=R) pkmPred$test.pred
Uses a Mixture-of-experts algorithm to find cluster centers that are influenced by prediction covariates.
predkmeans( X, R, K, mu = NULL, muStart = c("kmeans", "random"), sigma2 = 0, sigma2fixed = FALSE, maxitEM = 100, tol = 1e-05, convEM = c("both", "mu", "gamma"), nStarts = 1, maxitMlogit = 500, verbose = 0, muRestart = 1000, returnAll = FALSE, ... )
predkmeans( X, R, K, mu = NULL, muStart = c("kmeans", "random"), sigma2 = 0, sigma2fixed = FALSE, maxitEM = 100, tol = 1e-05, convEM = c("both", "mu", "gamma"), nStarts = 1, maxitMlogit = 500, verbose = 0, muRestart = 1000, returnAll = FALSE, ... )
X |
An |
R |
Covariates used for clustering. Required unless doing k-means
clustering (i.e. |
K |
Number of clusters |
mu |
starting values for cluster centers. If NULL (default),
then value is chosen according to |
muStart |
Character string indicating how initial value
of mu should be selected. Only used if mu=NULL. Possible
values are |
sigma2 |
starting value of sigma2. If set to |
sigma2fixed |
Logical indicating whether sigma2 should be held fixed. If FALSE, then sigma2 is estimated using Maximum Likelihood. |
maxitEM |
Maximum number of EM iterations for
finding the Mixture of Experts solution. If doing regular
k-means, this is passed as |
tol |
convergence criterion |
convEM |
controls the measure of convergence for the EM algorithm. Should be one of "mu", "gamma", or "both". Defaults to "both." The EM algorithm stops when the Frobenius norm of the change in mu, the change in gamma, or the change in mu and the change in gamma is less than 'tol'. |
nStarts |
number of times to perform EM algorithm |
maxitMlogit |
Maximum number of iterations in the mlogit optimization (nested within EM algorithm) |
verbose |
numeric vector indicating how much output to produce |
muRestart |
Gives max number of attempts at picking starting values. Only used when muStart='random'. If selected starting values for mu are constant within each cluster, then the starting values are re-selected up to muRestart times. |
returnAll |
A list containing all |
... |
Additional arguments passed to |
A thorough description of this method is provided in Keller et al. (2017). The algorithm for sovling the mixture of Experts model is based upon the approach presented by Jordan and Jacobs (1994).
If sigma2
is 0 and sigm2fixed
is TRUE, then standard k-means clustering (using kmeans
) is done instead.
An object of class predkmeans
, containing the following elements:
res.best |
A list containing the results from the best-fitting solution to the Mixture of Experts problem:
|
center |
Matrix of cluster centers |
cluster |
Vector of cluster labels assigned to observations |
K |
Number of clusters |
sigma2 |
Final value of sigma^2. |
wSS |
Mean within-cluster sum-of-squares |
sigma2fixed |
Logical indicator of whether sigma2 was held fixed |
Joshua Keller
Keller, J.P., Drton, M., Larson, T., Kaufman, J.D., Sandler, D.P., and Szpiro, A.A. (2017). Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts. Annals of Applied Statistics, 11(1):93–113.
Jordan M. and Jacobs R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural computation 6(2), 181-214.
predictML.predkmeans, predkmeansCVest
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) summary(pkm)
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) summary(pkm)
Performs cross-validation of predictive k-means clustering and cluster prediction.
predkmeansCVest( X, R, K, cv.groups = 10, sigma2 = 0, sigma2fixed = FALSE, scale = TRUE, covarnames = colnames(R), PCA = FALSE, PCAcontrol = list(covarnames = colnames(R), ncomps = 5), TPRS = FALSE, TPRScontrol = list(df = 5, xname = "x", yname = "y"), returnAll = FALSE, ... ) predkmeansCVpred( object, X = object$X, R = object$R, method = c("ML", "MixExp", "SVM"), ... )
predkmeansCVest( X, R, K, cv.groups = 10, sigma2 = 0, sigma2fixed = FALSE, scale = TRUE, covarnames = colnames(R), PCA = FALSE, PCAcontrol = list(covarnames = colnames(R), ncomps = 5), TPRS = FALSE, TPRScontrol = list(df = 5, xname = "x", yname = "y"), returnAll = FALSE, ... ) predkmeansCVpred( object, X = object$X, R = object$R, method = c("ML", "MixExp", "SVM"), ... )
X |
Outcome data |
R |
Covariates. Coerced to data frame. |
K |
Number of clusters |
cv.groups |
A list providing the cross-validation groups for splitting the data. groups for splitting the data. Alternatively, a single number giving the number of groups into which the data are randomly split. A value of '0' implies leave-one-out. Defaults to 10. |
sigma2 |
starting value of sigma2. Setting |
sigma2fixed |
Logical indicating whether sigma2 should be held fixed. If FALSE, then sigma2 is estimated using Maximum Likelihood. |
scale |
Should the outcomes be re-scaled within each training group? |
covarnames |
Names of covariates to be included directly. |
PCA |
Logical indicator for whether PCA components should be computed from R. |
PCAcontrol |
Arguments passed to |
TPRS |
Logical indicator for whether thin-plate regression splines should be created and added to covariates. |
TPRScontrol |
Arguments passed to |
returnAll |
A list containing all |
... |
Additional arguments passed to either |
object |
A |
method |
Character string indicating which prediciton method should be used. Optins are |
These wrappers are designed to simplify cross-validation of a dataset. For models including thin-plate regression splines (TPRS) or principal component analysis (PCA) scores, these functions will re-evaluate the TPRS basis or PCA decomposition on each training set.
Joshua Keller
predkmeans
, createPCAmodelmatrix
, createTPRSmodelmatrix
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkmcv <- predkmeansCVest(X=cbind(x1, x2), R=R, K=4, nStarts=4, cv.groups= 5, TPRS=FALSE, PCA=FALSE, covarnames=colnames(R)) pkmcv
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkmcv <- predkmeansCVest(X=cbind(x1, x2), R=R, K=4, nStarts=4, cv.groups= 5, TPRS=FALSE, PCA=FALSE, covarnames=colnames(R)) pkmcv
Function for re-ordering the order of clusters in a predkmeans object.
## S3 method for class 'predkmeans' relevel(x, ref = NULL, order = NULL, ...)
## S3 method for class 'predkmeans' relevel(x, ref = NULL, order = NULL, ...)
x |
object of class |
ref |
New reference group ("Cluster 1"). Only used if |
order |
New order of clusters. |
... |
Ignored additional arguments. |
The elements of the order
argument should refer
to the current position of clusters, with the position
giving the new order. So c(3, 1, 2)
moves 1 to 2, 2 to 3, and 3 to 1.
Joshua Keller
Other methods for predkmeans objects:
predictML.predkmeans()
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) table(pkm$cluster) # Move cluster '4' to be first pkm2 <- relevel(pkm, ref=4) table(pkm2$cluster) # Re-order based upon number of observations in each cluster pkm3 <- relevel(pkm, order=order(table(pkm$cluster), decreasing=TRUE)) table(pkm3$cluster)
n <- 200 r1 <- rnorm(n) r2 <- rnorm(n) u1 <- rbinom(n, size=1,prob=0) cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D")) mu1 <- c(A=2, B=2, C=-2, D=-2) mu2 <- c(A=1, B=-1, C=-1, D=-1) x1 <- rnorm(n, mu1[cluster], 4) x2 <- rnorm(n, mu2[cluster], 4) R <- model.matrix(~r1 + r2) X <- cbind(x1, x2) pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4) table(pkm$cluster) # Move cluster '4' to be first pkm2 <- relevel(pkm, ref=4) table(pkm2$cluster) # Re-order based upon number of observations in each cluster pkm3 <- relevel(pkm, order=order(table(pkm$cluster), decreasing=TRUE)) table(pkm3$cluster)