| Title: | Multiple Imputation in Cluster Analysis |
|---|---|
| Description: | Implementation of a framework for cluster analysis with selection of the final number of clusters and an optional variable selection procedure. The package is designed to integrate the results of multiple imputed datasets while accounting for the uncertainty that the imputations introduce in the final results. In addition, the package can also be used for a cluster analysis of the complete cases of a single dataset. The package also includes specific methods to summarize and plot the results. The methods are described in Basagana et al. (2013) <doi:10.1093/aje/kws289>. |
| Authors: | Jose Barrera-Gomez [aut, cre] (ORCID: <https://orcid.org/0000-0002-2688-6036>), Xavier Basagana [aut] (ORCID: <https://orcid.org/0000-0002-8457-1489>) |
| Maintainer: | Jose Barrera-Gomez <[email protected]> |
| License: | GPL-3 |
| Version: | 1.3.0 |
| Built: | 2026-06-08 20:27:35 UTC |
| Source: | https://github.com/cran/miclust |
Cluster analysis with selection of the final number of clusters and an optional variable selection procedure. The package is designed to integrate the results of multiply imputed datasets while accounting for the uncertainty that the imputations introduce in the final results. See ‘Procedure’ below for further details on how the tool works.
The tool consists of a two-step procedure. In the first
step the user provides the data to be analysed. They can be a single
data.frame or a list of data.frames including the raw data and the imputed
datasets. In the latter case, getdata needs to by used first to get
data prepared. In the second step, the miclust performs k-means
clustering with selection of the final number of clusters and an optional
(backward or forward) variable selection procedure. Specific summary
and plot methods are provided to summarize and visualize the impact
of the imputations on the results.
Jose Barrera-Gomez (maintainer, <[email protected]>) and Xavier Basagana.
Maintainer: Jose Barrera-Gomez [email protected] (ORCID)
Authors:
Jose Barrera-Gomez [email protected] (ORCID)
Xavier Basagana [email protected] (ORCID)
The methodology used in the package is described in
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J. A Framework for Multiple Imputation in Cluster Analysis. American Journal of Epidemiology. 2013;177(7):718-725.
midata object.Creates an object of class miData to be clustered by the function miclust.
getdata(data)getdata(data)
data |
a |
All variables in data frames in impdata are standardized by
getdata, so categorical variables need to be coded with numeric
values. Standardization is performed by centering all variables at the mean
and then dividing by the standard deviation (or the difference between the
maximum and the minimum values for binary variables). Such a
standardization is applied only to the imputed datasets. The
standardization of the raw data is internally applied by the
miclust if needed (which is the case of analysing just the
raw data, i.e., complete cases analysis).
An object of classes "list" and "midata" including the
following items:
a data frame containing the raw data.
if data is an object of class list, impdata
is a list containing the standardized imputed datasets.
### data minhanes: data(minhanes) class(minhanes) ### number of imputed datasets: length(minhanes) - 1 ### raw data with missing values: summary(minhanes[[1]]) ### first imputed dataset: minhanes[[2]] summary(minhanes[[2]]) ### data preparation for a complete case cluster analysis: data1 <- getdata(minhanes[[1]]) class(data1) names(data1) ### there are no imputed datasets: data1$impdata ### data preparation for a multiple imputation cluster analysis: data2 <- getdata(minhanes) class(data2) names(data2) ### number of imputed datasets: length(data2$impdata) ### imputed datasets are standardized: summary(data2$rawdata) summary(data2$impdata[[1]])### data minhanes: data(minhanes) class(minhanes) ### number of imputed datasets: length(minhanes) - 1 ### raw data with missing values: summary(minhanes[[1]]) ### first imputed dataset: minhanes[[2]] summary(minhanes[[2]]) ### data preparation for a complete case cluster analysis: data1 <- getdata(minhanes[[1]]) class(data1) names(data1) ### there are no imputed datasets: data1$impdata ### data preparation for a multiple imputation cluster analysis: data2 <- getdata(minhanes) class(data2) names(data2) ### number of imputed datasets: length(data2$impdata) ### imputed datasets are standardized: summary(data2$rawdata) summary(data2$impdata[[1]])
Creates a ranked selection frequency for all the variables that have been
selected at least once along the analysed imputed datasets.
getvariablesfrequency can be useful for customizing the plot of
these frequencies as it is shown in Examples below.
getvariablesfrequency(x, k = NULL)getvariablesfrequency(x, k = NULL)
x |
an object of class |
k |
the number of clusters. The default value is the optimal number of
clusters obtained by the function |
A list including the following items:
vector of the selection frequencies (percentage of times) of the variables in decreasing order.
names of the variables.
### see examples in miclust.### see examples in miclust.
Performs cluster analysis in multiple imputed datasets with optional variable
selection. Results can be summarized and visualized with the summary
and plot methods.
miclust( data, method = "kmeans", search = c("none", "backward", "forward"), ks = 2:3, maxvars = NULL, usedimp = NULL, distance = c("manhattan", "euclidean"), centpos = c("means", "medians"), initcl = c("hc", "rand"), verbose = TRUE, seed = NULL ) ## S3 method for class 'miclust' print(x, ...) ## S3 method for class 'miclust' plot( x, k = NULL, metric = c("all", "nclfreq", "critcf", "nvarfreq", "varsel"), col.nclfreq = "gray", col.critcf = "gray", col.nvarfreq = "gray", col.varsel = "black", col.all = NULL, ... )miclust( data, method = "kmeans", search = c("none", "backward", "forward"), ks = 2:3, maxvars = NULL, usedimp = NULL, distance = c("manhattan", "euclidean"), centpos = c("means", "medians"), initcl = c("hc", "rand"), verbose = TRUE, seed = NULL ) ## S3 method for class 'miclust' print(x, ...) ## S3 method for class 'miclust' plot( x, k = NULL, metric = c("all", "nclfreq", "critcf", "nvarfreq", "varsel"), col.nclfreq = "gray", col.critcf = "gray", col.nvarfreq = "gray", col.varsel = "black", col.all = NULL, ... )
data |
object of class |
method |
clustering method. Currently, only |
search |
search algorithm for the selection variable procedure:
|
ks |
the values of the explored number of clusters. Default is exploring 2 and 3 clusters. |
maxvars |
if |
usedimp |
numeric. Which imputed datasets must be included in the
cluster analysis. If |
distance |
two metrics are allowed to compute distances:
|
centpos |
position computation of the cluster centroid. If |
initcl |
starting values for the clustering algorithm. If |
verbose |
a logical value indicating output status messages. Default is
|
seed |
a number. Seed for reproducibility of results. Default is
|
x |
for |
... |
further arguments for |
k |
for |
metric |
for |
col.nclfreq, col.critcf, col.nvarfreq, col.varsel
|
for |
col.all |
An optional character string or integer specifying a global
color. If provided, it overrides all specific color arguments listed above,
applying the same color across all subplots. Defaults to |
The optimal number of clusters and the final set of variables are selected according to CritCF. CritCF is defined as
where is the number of variables, is the number of clusters,
and and are the within- and between-cluster inertias. Higher
values of CritCF are preferred (Breaban, 2011). See References below for further
details about the clustering algorithm.
For computational reasons, option "rand" is suggested instead of "hc"
for high dimensional data.
A list with class "miclust" including the following items:
a list of lists containing the results of the clustering algorithm for each analyzed dataset and for each analyzed number of clusters. Includes information about selected variables and the cluster vector.
if data contains a single data frame, percentage
of complete cases in data.
input data.
the values of the explored number of clusters.
indicator of the imputed datasets used.
optimal number of clusters.
if data contains a single data frame, critcf contains
the optimal (maximum) value of CritCF (see Details) and the number of selected
variables in the reduction procedure for each explored number of clusters. If
data is a list, critcf contains the optimal value of CritCF for
each imputed dataset and for each explored value of the number of clusters.
number of selected variables.
if data is a list, frequency of selection of
each analyzed number of clusters.
input method.
input search.
input maxvars.
input distance.
input centpos.
an object of class kccaFamily needed by the specific
summary method.
input initcl.
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J. A framework for multiple imputation in cluster analysis. American Journal of Epidemiology. 2013;177(7):718-25.
Breaban M, Luchian H. A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition 2001;44(4):854-65.
getdata for data preparation before using miclust.
### data preparation: minhanes1 <- getdata(data = minhanes) ################## ### ### Example 1: ### ### Multiple imputation clustering process with backward variable selection ### ################## ### using only the imputations 1 to 10 for the clustering process and exploring ### 2 vs. 3 clusters: minhanes1clust <- miclust(data = minhanes1, search = "backward", ks = 2:3, usedimp = 1:10, seed = 4321) minhanes1clust minhanes1clust$kfin ### optimal number of clusters ### graphical summary: plot(minhanes1clust) ### selection frequency of the variables for the optimal number of clusters: y <- getvariablesfrequency(minhanes1clust) y plot(y$percfreq, type = "h", main = "", xlab = "Variable", ylab = "Percentage of times selected", xlim = 0.5 + c(0, length(y$varnames)), lwd = 15, col = "blue", xaxt = "n") axis(1, at = 1:length(y$varnames), labels = y$varnames) ### default summary for the optimal number of clusters: summary(minhanes1clust) ## summary forcing 3 clusters: summary(minhanes1clust, k = 3) ################## ### ### Example 2: ### ### Same analysis but without variable selection ### ################## minhanes2clust <- miclust(data = minhanes1, ks = 2:3, usedimp = 1:10, seed = 4321) minhanes2clust plot(minhanes2clust) summary(minhanes2clust) ################## ### ### Example 3: ### ### Complete cases clustering process with backward variable selection ### ################## nhanes0 <- getdata(data = minhanes[[1]]) nhanes2clust <- miclust(data = nhanes0, search = "backward", ks = 2:3, seed = 4321) nhanes2clust summary(nhanes2clust) ### nothing to plot for a single dataset analysis # plot(nhanes2clust) ################## ### ### Example 4: ### ### Complete case clustering process without variable selection ### ################## nhanes3clust <- miclust(data = nhanes0, ks = 2:3, seed = 4321) nhanes3clust summary(nhanes3clust)### data preparation: minhanes1 <- getdata(data = minhanes) ################## ### ### Example 1: ### ### Multiple imputation clustering process with backward variable selection ### ################## ### using only the imputations 1 to 10 for the clustering process and exploring ### 2 vs. 3 clusters: minhanes1clust <- miclust(data = minhanes1, search = "backward", ks = 2:3, usedimp = 1:10, seed = 4321) minhanes1clust minhanes1clust$kfin ### optimal number of clusters ### graphical summary: plot(minhanes1clust) ### selection frequency of the variables for the optimal number of clusters: y <- getvariablesfrequency(minhanes1clust) y plot(y$percfreq, type = "h", main = "", xlab = "Variable", ylab = "Percentage of times selected", xlim = 0.5 + c(0, length(y$varnames)), lwd = 15, col = "blue", xaxt = "n") axis(1, at = 1:length(y$varnames), labels = y$varnames) ### default summary for the optimal number of clusters: summary(minhanes1clust) ## summary forcing 3 clusters: summary(minhanes1clust, k = 3) ################## ### ### Example 2: ### ### Same analysis but without variable selection ### ################## minhanes2clust <- miclust(data = minhanes1, ks = 2:3, usedimp = 1:10, seed = 4321) minhanes2clust plot(minhanes2clust) summary(minhanes2clust) ################## ### ### Example 3: ### ### Complete cases clustering process with backward variable selection ### ################## nhanes0 <- getdata(data = minhanes[[1]]) nhanes2clust <- miclust(data = nhanes0, search = "backward", ks = 2:3, seed = 4321) nhanes2clust summary(nhanes2clust) ### nothing to plot for a single dataset analysis # plot(nhanes2clust) ################## ### ### Example 4: ### ### Complete case clustering process without variable selection ### ################## nhanes3clust <- miclust(data = nhanes0, ks = 2:3, seed = 4321) nhanes3clust summary(nhanes3clust)
A list with 101 datasets. The first dataset contains nhanes
data from mice package. The remaining datasets were obtained by
applying the multiple imputation function mice from package mice.
minhanesminhanes
A list of 101 data.frames each of them with 25 observations of the following 4 variables:
age group (1 = 20-39, 2 = 40-59, 3 = 60+). Treated as numerical.
body mass index (kg/m)
hypertensive (1 = no, 2 = yes). Treated as numerical.
total serum cholesterol (mg/dL)
https://CRAN.R-project.org/package=mice
data(minhanes) ### raw data: minhanes[[1]] summary(minhanes[[1]]) ### number of imputed datasets: length(minhanes) - 1 ### first imputed dataset: minhanes[[2]] summary(minhanes[[2]])data(minhanes) ### raw data: minhanes[[1]] summary(minhanes[[1]]) ### number of imputed datasets: length(minhanes) - 1 ### first imputed dataset: minhanes[[2]] summary(minhanes[[2]])
Performs a within-cluster descriptive analysis of the variables after the
clustering process performed by the function miclust.
## S3 method for class 'miclust' summary(object, k = NULL, quantilevars = NULL, ...) ## S3 method for class 'summary.miclust' print(x, digits = 2, ...)## S3 method for class 'miclust' summary(object, k = NULL, quantilevars = NULL, ...) ## S3 method for class 'summary.miclust' print(x, digits = 2, ...)
object |
object of class |
k |
number of clusters. The default value is the optimal number of
clusters obtained by |
quantilevars |
numeric. If a variable selection procedure was used,
the cut-off percentile in order to decide the number of selected variables
in the variable reduction procedure by decreasing order of presence along
the imputations results. The default value is |
... |
further arguments for |
x |
for the |
digits |
digits for the |
A list including the following items:
if imputations were analysed, descriptive summary of the probability of cluster assignment.
if imputations were analysed, the individual probabilities of cluster assignment.
if imputations were analysed, the final individual cluster assignment.
if imputations were analysed, size of the imputed cluster and between-imputations summary of the cluster size.
if a single dataset (raw dataset) has been clustered, a vector containing the individuals cluster assignments.
if imputed datasets have been clustered, the individual cluster assignment in each imputation.
if a single dataset (raw dataset) has been clustered, the percentage of complete cases in the dataset.
number of clusters.
if imputations were analysed, the Cohen's kappa values after comparing the cluster vector in the first imputation with the cluster vector in each of the remaining imputations.
a summary of kappas.
number of imputations used in the descriptive analysis which is the total number of imputations provided.
if variable selection was performed, the input value
of quantilevars.
search algorithm for the selection variable procedure.
if variable selection was performed, the selected
variables obtained considering quantilevars.
if imputations were analysed and variable selection was performed, the presence of the selected variables along imputations.
within-cluster descriptive analysis of the selected variables.
indicator of imputations used in the clustering procedure.
### see examples in miclust.### see examples in miclust.