Title: | Conditional Mixture Modeling and Model-Based Clustering |
---|---|
Description: | Conditional mixture model fitted via EM (Expectation Maximization) algorithm for model-based clustering, including parsimonious procedure, optimal conditional order exploration, and visualization. |
Authors: | Yang Wang [aut, cre], Volodymyr Melnykov [aut], Stephen Moshier [ctb] (eigenvalue calculations in c), Rouben Rostamian [ctb] (memory allocation in c) |
Maintainer: | Yang Wang <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.1 |
Built: | 2024-10-31 22:13:53 UTC |
Source: | CRAN |
The utility of this package includes fitting a conditional mixture model with EM (Expectation Maximization) algorithm, model-based clustering based on conditional mixture modeling, conditional mixture modeling with parsimonious procedures, and optimal conditional order exploration by using either a full search or the proposed searching algorithm, and illustration of clustering results through pairwise plots.
Function 'cmb.em' runs the parsimonious conditional mixture modeling for a user-specified conditioning order.
Function 'cmb.search' runs the 'cmb.em' procedure for all possible conditioning orders and determines the optimal order using BIC, or runs the proposed optimal order search algorithm and then the 'cmb.em' for the obtained optimal order.
Function 'cmb.plot' builds pairwise plots to present clustering results from functions 'cmb.em' and 'cmb.search".
Yang Wang and Volodymyr Melnykov.
Maintainer: Yang Wang <[email protected]>
Melnykov, V., and Wang, Y. (2023). Conditional mixture modeling and model-based clustering. Pattern Recognition, 133, p. 108994.
The data set considers body characteristics from 102 male and 100 female athletes at the Australian Institute of Sport. It is collected for a study of how data on various features varied with sport body size and sex of athlete.
data(ais)
data(ais)
A data frame with 202 observations on the following 13 variables.
Factor with levels: female
, male
;
Factor with levels: B_Ball
,
Field
, Gym
, Netball
, Row
Swim
, T_400m
, Tennis
, T_Sprnt
,
W_Polo
;
Red cell count;
White cell count;
Hematocrit;
Hemoglobin;
Plasma ferritin concentration;
Body Mass Index;
Sum of skin folds;
Body fat percentage;
Lean body mass;
Height, cm;
Weight, kg
The data have been made publicly available in connection with the book by Cook, R.D. and Weisberg, S. (1994, ISBN-10:0471008397).
Cook, R.D. and Weisberg, S. (1994). An introduction to regression graphics. John Wiley & Sons.
Runs conditional mixture modeling and model-based clustering by EM algorithm (Expectation Maximization) for a prespecified variables conditioning order. Runs variable selection procedure (forward, backward or stepwise) to achieve a parsimonious mixture model.
cmb.em(x, order = NULL, l, K, method = "stepwise", id0 = NULL, n.em = 200, em.iter = 5, EM.iter = 200, nk.min = NULL, max.spur=5, tol = 1e-06, silent = FALSE, Parallel = FALSE, n.cores = 4)
cmb.em(x, order = NULL, l, K, method = "stepwise", id0 = NULL, n.em = 200, em.iter = 5, EM.iter = 200, nk.min = NULL, max.spur=5, tol = 1e-06, silent = FALSE, Parallel = FALSE, n.cores = 4)
x |
dataset matrix (n x p) |
order |
customized variables' conditioning order (length p) |
l |
order of polynomial regression model |
K |
number of clusters |
method |
variable selection method (options 'stepwise', 'forward', 'backward' and 'none') |
id0 |
initial membership vector (length n) |
n.em |
number of short EM in an emEM procedure |
em.iter |
maximum number of iterations of short EM in an emEM procedure |
EM.iter |
maximum number of EM iterations |
nk.min |
spurious output control |
max.spur |
number of trials |
tol |
tolerance level |
silent |
output control (TRUE/FALSE) |
Parallel |
parallel computing (TRUE/FALSE) |
n.cores |
number of cores in parallel computing |
In conditional mixture modeling, each component is modeled by a product of conditional distributions with the means expressed by polynomial regression functions depending on other variables. Polynomial regression function order l
and the number of clusters K
are prespecified by user. The model's initialization can be determined by passing a group membership vector to the argument id
, or obtained by the emEM algorithm (the default setting) in the function. There are two arguments related to the emEM procedure, the number of short EM n.em
and maximum number of iterations for short EM em.iter
. By default, the n.em = 200
and em.iter = 5
. The method of variable selection can be specified as method = "stepwise", "forward", "backward", or "none"
where method = none
means no parsimonious procedure conducted. During the model fitting and variable selection phases, EM algorithm will be applied multiple times, where options EM.iter
and tol
are stopping criteria of EM iteration. The spurious output control argument nk.min
, by default nk.min = (l x (p - 1) + 1) x 2
, can be set by user. When spurious output is obtained, cmb.em
will be rerun. The maximum number of rerunning is max.spur
.
Notation: n - sample size, l - order of polynomial regression model, K - number of mixture components.
data |
input dataset |
model |
estimated regression models for each cluster (K x p matrix) |
id |
vector of estimated membership (length n) |
loglik |
estimated log likelihood |
BIC |
Bayesian Information Criterion |
Pi |
vector of estimated mixing proportions (length K) |
tau |
matrix of estimated posterior probabilities (n x K) |
beta |
matrix of estimated regression parameters (K x (p + p(p-1)l/2) ) |
s2 |
matrix of estimated variance (K x p) |
order |
applied conditioning order (length p) |
n_pars |
number of model parameters |
Biernacki C., Celeux G., Govaert G. (2003). Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models. Computational Statistics and Data Analysis, 41(3-4), pp. 561-575.
set.seed(1) K <- 3 l <- 2 x <- as.matrix(iris[,-5]) id.true <- iris[,5] # Run EM algorithm for fitting a conditioning mixture model obj <- cmb.em(x = x, order = c(1,3,2,4), l, K, method = "stepwise", silent = FALSE, Parallel = FALSE) id.cmb <- obj$id table(id.true, id.cmb) obj$BIC
set.seed(1) K <- 3 l <- 2 x <- as.matrix(iris[,-5]) id.true <- iris[,5] # Run EM algorithm for fitting a conditioning mixture model obj <- cmb.em(x = x, order = c(1,3,2,4), l, K, method = "stepwise", silent = FALSE, Parallel = FALSE) id.cmb <- obj$id table(id.true, id.cmb) obj$BIC
cmb.plot
demonstrates the clustering results of functions cmb.em
and cmb.search
. A graph with a combination of pairwise scatter plot for data points, pairwise contour plot of estimated mixture density, and pairwise regression curves is produced.
cmb.plot(obj, allcolors = NULL, allpch = NULL, lwd = 1, cex.text = 1, cex.point = 0.6, mar = c(0.6,0.6,0.6,0.6), oma = c(3.5,3.5,2.5,14), nlevels = 30)
cmb.plot(obj, allcolors = NULL, allpch = NULL, lwd = 1, cex.text = 1, cex.point = 0.6, mar = c(0.6,0.6,0.6,0.6), oma = c(3.5,3.5,2.5,14), nlevels = 30)
obj |
output object of the function |
allcolors |
colous of clusters (length K) |
allpch |
styles of data points in clusters (length K) |
lwd |
line width, a positive number, defaulting to 1 |
cex.text |
magnification of labels and titles, defaulting to 1 |
cex.point |
magnification of plotting symbols, defaulting to 0.6 |
mar |
margin sizes of plots in lines of text (length 4) |
oma |
outer margin sizes of a pairwise plot in lines of text (length 4) |
nlevels |
number of contour levels, defaulting to 30 |
This function generates a graphic.
set.seed(4) K <- 3 l <- 2 x <- as.matrix(iris[,-5]) # Run EM algorithm for fitting a conditioning mixture model obj <- cmb.em(x = x, order = c(1,2,3,4), l, K, method = "stepwise", silent = TRUE, Parallel = FALSE) cmb.plot(obj)
set.seed(4) K <- 3 l <- 2 x <- as.matrix(iris[,-5]) # Run EM algorithm for fitting a conditioning mixture model obj <- cmb.em(x = x, order = c(1,2,3,4), l, K, method = "stepwise", silent = TRUE, Parallel = FALSE) cmb.plot(obj)
Runs forward, backward, or stepwise variable selection procedure for obtaining the parsimonious conditional mixture models when all conditional orders are considered. Alternatively, runs the optimal order search algorithm, and parsimonious conditional mixture modeling for the obtained order.
cmb.search(x, l, K, method = "stepwise", all.perms = TRUE, id0 = NULL, n.em = 200, em.iter = 5, EM.iter = 200, nk.min = NULL, max.spur = 5, tol = 1e-06, silent = FALSE, Parallel = TRUE, n.cores = 4)
cmb.search(x, l, K, method = "stepwise", all.perms = TRUE, id0 = NULL, n.em = 200, em.iter = 5, EM.iter = 200, nk.min = NULL, max.spur = 5, tol = 1e-06, silent = FALSE, Parallel = TRUE, n.cores = 4)
x |
dataset matrix (n x p) |
l |
order of polynomial regression model |
K |
number of clusters |
method |
variable selection method (options 'stepwise', 'forward', 'backward' and 'none') |
all.perms |
conditioning order search algorithm ( |
id0 |
initial group membership (length n) |
n.em |
number of short EM in emEM procedure |
em.iter |
maximum number of short EM iterations in emEM |
EM.iter |
maximum number of EM iterations |
nk.min |
spurious output control |
max.spur |
number of trials |
tol |
tolerance level |
silent |
output control |
Parallel |
Parallel computing |
n.cores |
number of cores in parallel computing |
Functions 'cmb.search' and 'cmb.em' have common arguments except 'all.perm'. With all.perms = TRUE
, a full search is applied to data, that is running parsimonious conditional mixture modeling for all orders and recognizing the optimal order based on the BIC. Then two lists are returned: best.model
stores the results for the conditional mixture model with the optimal order, and models
has results for all orders. With the option all.perms = FALSE
, the optimal conditional order search algorithm is applied, and then only the list best.model
is returned.
The list models
is returned when all.perms = TRUE
.
best.model |
membership assignments and estimated parameters of mixture model with the optimal contioning order.
|
models |
membership assignments and model parameters of mixture models with all conditioning orders.
|
cmb.em
set.seed(1) K = 3 l <- 2 x <- as.matrix(iris[,-5]) obj <- cmb.search(x = x, l, K, method = "stepwise", all.perms = FALSE, Parallel = FALSE, silent = FALSE) obj$best.model$BIC
set.seed(1) K = 3 l <- 2 x <- as.matrix(iris[,-5]) obj <- cmb.search(x = x, l, K, method = "stepwise", all.perms = FALSE, Parallel = FALSE, silent = FALSE) obj$best.model$BIC
Datasets are simulated from conditional mixture models with different numbers of components.
data(smltn)
data(smltn)
Two datasets are stored in the data smltn
. smltn1
is a data matrix with 200 observations on two variables and one group membership; smltn2
is a matrix with 300 observations on two variables and one group ID.
data(smltn) # view data matrices smltn1 and smltn2 print(smltn1) print(smltn2)
data(smltn) # view data matrices smltn1 and smltn2 print(smltn1) print(smltn2)