clusterMI
is an R package to perform clustering with
missing values. Missing values are addressed by multiple imputation. The
package offers various multiple imputation methods dedicated to
clustered individuals (Audigier, Niang, and
Resche-Rigon (2021)). In addition, it allows pooling results both
in terms of partition and instability (Audigier
and Niang (2022)). Among applications, these functionalities can
be used to choose a number of clusters with missing values.
The wine
data set (Asuncion and
Newman (2007)) is the result of a chemical analysis of 177
Italian wine samples from three different cultivars. Each wine is
described by 13 continuous variables. This data set will be used to
illustrate the clusterMI
package. For achieving this goal,
missing values will be added, and the cultivar variable will be
omitted.
require(stargazer)
set.seed(123456)
data(wine)
stargazer(wine, type = "text")
#>
#> ============================================
#> Statistic N Mean St. Dev. Min Max
#> --------------------------------------------
#> cult 178 1.900 0.780 1 3
#> alco 178 13.000 0.810 11.000 15.000
#> malic 178 2.300 1.100 0.740 5.800
#> ash 178 2.400 0.270 1.400 3.200
#> alca 178 19.000 3.300 11.000 30.000
#> mg 178 100.000 14.000 70 162
#> phe 178 2.300 0.630 0.980 3.900
#> fla 178 2.000 1.000 0.340 5.100
#> nfla 178 0.360 0.120 0.130 0.660
#> pro 178 1.600 0.570 0.410 3.600
#> col 178 5.100 2.300 1.300 13.000
#> hue 178 0.960 0.230 0.480 1.700
#> ratio 178 2.600 0.710 1.300 4.000
#> prol 178 747.000 315.000 278 1,680
#> --------------------------------------------
table(wine$cult)
#>
#> 1 2 3
#> 59 71 48
Missing values are artificially added according to a missing completely at random mechanism so that each value of the data set is missing with a probability of 1/3 (independently to the values themselves).
The clusterMI
package offers various multiple imputation
methods dedicated to clustered individuals. They can be divided into two
categories: joint modelling (JM) imputation and fully conditional
specification (FCS). The first assumes a joint distribution for all
variables and imputation is performed using this model. The second
proceeds variable per variable in a sequential manner by regression. FCS
methods are more time consuming, but also more flexible, allowing a
better fit of the imputation model. Multiple imputation is performed
with the imputedata
function. We start by presenting JM
approaches in Section (@ref(JM)), while FCS approaches will be presented
in Section (@ref(FCS)).
The package proposes two JM methods: JM-GL and JM-DP. Both are based on a multivariate gaussian mixture model.
JM-GL
is implemented in the mix
package
(Joseph L. Schafer. (2022)). Initially,
this method is dedicated to the imputation of mixed data, but it can be
used by considering the partition variable as fully incomplete
categorical variable. The method assumes constant variance in each
cluster (for continuous data).
JM-DP
is a joint modelling method implemented in the
R packages DPImputeCont
(Kim
(2020)), NPBayesImputeCat
(Wang et al. (2022)),
MixedDataImpute
(Murray and Reiter
(2015)) for continuous, categorical or mixed data respectively.
Such a method has the advantage to automatically determine a number of
clusters (but this number needs to by bounded by nb.clust
).
Furthermore, it allows various covariance matrices according to the
clusters (for continuous data).
JM-GL is the default imputation method used in
imputedata
. To perform multiple imputation with this
default method, we proceed as follows:
m <- 20 # Number of imputed data sets
res.imp.JM <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
m = m)
and we specify method = "JM-DP"
for imputation using the
other JM method:
Both imputation methods consist in a data-augmentation algorithm
alternating data imputation and drawing from a posterior distribution.
The m
imputed datasets are obtained by keeping one imputed
dataset every L
iterations. The first Lstart
iterations consists in a burn-in period, which is required to reach
convergence to the posterior distribution (expected from incomplete
data). The L
iterations between successive draws guarantee
an independance between imputed values.
Lstart
and L
can be checked by graphical
investigations. For achieving this goal, we track the between inertia
for each variable over successive iterations of the data-augmentation
algorithm (available in the res.conv
output from the
imputedata
function). In practice, we run imputation for a
large number of imputed datasets m
by keeping all
intermediate imputed datatsets, i.e. at each iteration
(L = 1
) from the first (Lstart = 1
):
res.imp.JM.conv <- imputedata(data.na = wine.na,
method = "JM-GL",
nb.clust = nb.clust,
m = 800,
Lstart = 1, # number of iterations for the burn-in period
L = 1 # number of iterations between each draw
)
and then plot the successive between inertia values for the four first variables (as an example):
res.conv <- res.imp.JM.conv$res.conv
res.conv.ts <- ts(t(res.conv)) # conversion as time-series object
plot(res.conv.ts[, 1:4]) # diagnostic from the 4 first variables
Here, convergence seems to be reached at 400 iterations, meaning
Lstart = 400
is a suitable choice.
Next, the number of iterations between successive draws can be
checked by visualising the autocorrelograms. Here, an autocorrelogram
represents the correlation between the vector of the successive between
inertia and its shifted version for several lags. We seek to find a lag
L
sufficiently large to avoid correlation between the
vectors of between inertia. Such graphics can be obtained as
follows:
Lstart <- 400
# extraction of summaries after Lstart iterations for the 4 first variables
res.conv.ts <- res.conv.ts[Lstart:nrow(res.conv.ts), 1:4]
apply(res.conv.ts, 2, acf)
Following such graphics L = 20
(the default value) seems
sufficient. Thus, the imputation step can be rerun as follows:
Lstart <- 400
L <- 20
res.imp.JM <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
Lstart = Lstart,
L = L,
m = m)
Note that the imputation also requires a pre-specified number of
clusters (nb.clust
). Here it is tuned to 3 (corresponding
to the number of varieties). We explain in Section @ref(nbclust) how it
can be tuned. Furthermore, the number of imputed data sets is tuned to
m = 20
which is generally enough (this choice will be
discussed in Section @ref(nbtab)).
Fully conditional specification methods consist in a variable per
variable imputation. The two fully conditional imputation methods
proposed are FCS-homo
and FCS-hetero
(Audigier, Niang, and Resche-Rigon (2021)). They
essentially differ by the assumption about the covariance in each
cluster (constant or not respectively).
To perform multiple imputation, we proceed as follows:
maxit <- 20 # Number of iterations for FCS imputation, should be larger in practice
res.imp.FCS <- imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m)
With FCS methods, the imputedata
function alternates
cluster analysis and imputation given the partition of individuals. When
the cluster analysis is performed, the imputedata
function
calls the mice
function from the mice R package (van Buuren and Groothuis-Oudshoorn (2011)). The
mice
package proposes various methods for imputation. By
default, imputedata
uses the default method used in
mice
(predictive mean matching for continuous data), but
others can be specified by tuning the method.mice
argument.
For instance, for imputation under the normal model, use
imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m,
method.mice = "norm")
FCS-hetero
allows imputation of continuous variables
according to linear mixed models (various methods are available in the
micemd
R package Audigier and
Resche-Rigon (2023)). Furthermore, contrary to FCS-homo,
FCS-hetero updates the partition without assuming constant variance in
each cluster.
FCS imputation consists in imputing each variable sequentially
several times. Many iterations can be required (maxit
argument). For checking convergence, the within and between inertia of
each imputed variable can be plotted at each iteration, as proposed by
the choosemaxit
function
Note that by default, only the five first imputed data sets are
plotted (corresponding to the number of curves plotted for each
variable). The plotm
argument can be tuned to modify which
curves should be drawn.
In this case, the number of iterations could be potentially
increased. For achieving this goal, the imputation should be rerun by
increasing the maxit
argument as follows:
res.imp <- imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = 100,
m = m)
choosemaxit(res.imp)
For computational reasons, convergence diagnostic can be achieved by
decreasing the number of imputed datasets m
. When the
number of iterations maxit
will be chosen, then multiple
imputation with a larger value for m
could be
considered.
Fully conditional imputation methods are quickly limited when the
number of variables is large since imputation models become overfit. To
address this issue, we can use penalised regression as proposed in
mice
by specifying method.mice = lasso.norm
for instance. Another way consists in specifying conditional imputation
models by tuning the predictmat
argument. This argument is
a binary matrix where each row indicates which explanatory variables (in
column) should be used for imputation.
varselbest
procedureTo tune this matrix in an automatic way, the varselbest
function proposes to perform variable selection following Bar-Hen and Audigier (2022). Briefly,
varselbest
performs variable selection on random subsets of
variables and then, combines them to recover which explanatory variables
are related to the response. More precisely, the outlines of the
algorithm are as follows: let consider a random subset of
sizeblock
among p variables. Then, any selection variable
scheme can be applied (lasso, stepwise and knockoff are proposed by
tuning the method.select
argument). By resampling
(B
times) a sample of size sizeblock
among the
p variables, we may count how many times a variable is considered as
significantly related to the response and how many times it is not. We
need to define a threshold (r
) to conclude if a given
variable is significantly related to the response (by default,
r
= 0.3). The main advantage of this function is that it
handles both missing values and high-dimensional data.
By default, the varselbest
function performs variable
selection by knockoff (Barber and Candès
(2015)) based on B
= 200 bootstrap subsets. The
threshold r
is tuned at 0.3 allowing to omit only variables
very poorly predictive. The choices of B
and r
are discussed in next sections.
Since the method is time consuming, the function allows parallel
computing by tuning the nnodes
argument. In the next
example, the imputation model for the variable alco
is
obtained using the algorithm previously described.
nnodes <- 2
# Number of CPU cores used for parallel computation.
# Use parallel::detectCores() to choose an appropriate number
# variable selection to impute the "alco" variable
B <- 50 # number of bootstrap subsets, should be increased in practice
res.varsel <- varselbest(res.imputedata = res.imp.FCS, B = B, listvar = "alco",
nnodes = nnodes, graph = FALSE)
res.varsel$predictormatrix["alco", ]
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0 0 1 1 1 0 1 0 1 1 1 1 1
The function suggests considering the variables ash, alca, mg, fla,
pro, col, hue, ratio, prol to impute the alco
variable.
Then, imputation can be rerun by specifying the predictmat
argument returned by the varselbest
function
# multiple imputation with the new model
res.imp.select <- imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m,
predictmat = res.varsel$predictormatrix)
Note that for specifying all conditional imputation models you should use
The number of iterations B
should be large so that the
proportion of times a variable is selected becomes stable. The
chooseB
function plots the proportion according to the
number of iterations.
B=50
iterations seems not enough.
To tune the threshold r
, the vector of proportions can
be graphically investigated.
# check the variable importance
round(res.varsel$proportion["alco",], 2)
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0.00 0.26 0.36 0.45 0.33 0.10 0.33 0.14 0.45 0.95 0.71 0.60 0.36
barplot(sort(res.varsel$proportion["alco",], decreasing=TRUE),
ylab = "proportion",
main = "alco",
ylim = c(0, 1),
las = 2,
cex.names = .5)
r <- 0.2 # a new threshold value (r = 0.3 by default)
abline(h = r, col = 2, lty = 2)
Then, the predictor matrix can be easily updated at hand as follows
predictormatrix <- res.varsel$predictormatrix
predictormatrix[res.varsel$proportion>r] <- 1
predictormatrix[res.varsel$proportion<=r] <- 0
predictormatrix["alco", ]
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0 1 1 1 1 0 1 0 1 1 1 1 1
By decreasing the threshold, more variables are used in the imputation model.
A more automatic way to tune r
consists in using the
function chooser
. chooser
computes an optimal
threshold using K-fold cross-validation. The call to
chooser
is highly time consuming, but the optimal value of
r
can be follows during the process through graphical
outputs. By this way, the user can stop the process early. This can be
achieved as follows:
After multiple imputation, cluster analysis and partition pooling can
be done through the clusterMI
function (Audigier and Niang (2022)). Next, kmeans
clustering is applied on the imputed data sets as an example.
kmeans is the clustering method used by default.
The clusterMI
function returns a consensus partition
(part
object) as well as a instability measure
(instability
object). The instability
object
gathers the instability of each contributory partition (U
),
their average (Ubar
), the between instability
(B
) and the total instability (T
).
part <- res.pool.kmeans$part
table(part) #compute cluster sizes
#> part
#> 1 2 3
#> 53 63 62
table(part, ref) #compare the partition with the reference partition
#> ref
#> part 1 2 3
#> 1 0 5 48
#> 2 2 61 0
#> 3 57 5 0
res.pool.kmeans$instability # look at instabilitiy measures
#> $U
#> [1] 0.0181 0.0095 0.0225 0.0182 0.0255 0.0191 0.0332 0.0271 0.0182 0.0194
#> [11] 0.0241 0.0170 0.0383 0.0180 0.0179 0.0230 0.0349 0.0310 0.0191 0.0280
#>
#> $Ubar
#> [1] 0.023
#>
#> $B
#> [1] 0.072
#>
#> $Tot
#> [1] 0.095
Among other clustering methods, clusterMI
allows cluster
analysis by k-medoids (method.clustering = "pam"
),
clustering large applications
(method.clustering = "clara"
), hierarchical clustering
(method.clustering = "hclust"
), fuzzy c-means
(method.clustering = "cmeans"
), or model-based method
(method.clustering = "mixture"
)
The user can also use custom clustering methods. For instance, to use
reduced k-means, as implemented in the R package clustrd
,
analysis and pooling can be achieved as follows:
library(clustrd)
res.ana.rkm <- lapply(res.imp.JM$res.imp,
FUN = cluspca,
nclus = nb.clust,
ndim = 2,
method= "RKM")
# extract the set of partitions (as list)
res.ana.rkm <- lapply(res.ana.rkm, "[[", "cluster")
# pooling by NMF
res.pool.rkm <- fastnmf(res.ana.rkm, nb.clust = nb.clust)
part.rkm <- res.pool.rkm$best$clust# extract the best solution based on several initialisations
Note that in this case, the instability is not computed.
To check if the imputation model correctly fit the data, a classical
way is to perform overimputation (Blackwell,
Honaker, and King (2015)), as proposed by the
overimpute
function. Overimputation consists in imputing
observed values several times (100 or more) and to compare the observed
values with their imputed values.
Overimputation is a time-consuming process. To limit the time required for achieving overimputation the user can:
nnodes
argumentplotinds
argumentplotvars
argumentIn the next example, we use parallel computing and perform
overimputation on the first variable (alco
) for 20
individuals (at random) only.
# Multiple imputation is rerun with more imputed data sets (m = 100)
res.imp.over <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
m = 100,
Lstart = Lstart,
L = L,
verbose = FALSE)
# selection of 20 complete individuals on variable "alco"
plotinds <- sample(which(!is.na(wine.na[, "alco"])),
size = 20)
res.over <- overimpute(res.imp.over,
nnodes = nnodes,
plotvars = "alco",
plotinds = plotinds)
The graphic represents the observed values (x-axis) versus the 90% prediction interval (y-axis) for these values. Various colours are used according to the proportion of observed value used to build the interval (which depends on the missing data pattern on each individual). If the conditional model fits the data well, then 90% of intervals cutting the first bisector (indicating the observed values are gathered in the interval) are expected.
Here, the imputation model for the alco
variable does
not fit the observed data very well since the coverage is 25%. The fit
could be improved by investigating FCS imputation methods.
m
)The number of imputed data sets (m
) should be
sufficiently large to improve the partition accuracy. The
choosem
function can be used to check if this number is
suitable. This function computes the consensus partition by considering
only the first imputed data sets. By this way, a sequence of
m
consensus partitions is obtained. Then, the rand index
between successive partitions is computed and reported in a graph. The
rand index measures proximity between partitions. If the rand index
between the last consensus partitions of the sequence reaches its
maximum values (1), then the number of imputed data sets does not modify
the consensus partition and this number can be considered as
sufficiently large.
Here, the rand index is equal to 1 at m = 20
imputed
data sets, meaning the consensus partition remains unchanged between
m = 19
and m = 20
, and so probably beyond.
Consequently, m = 20
was a suitable choice.
In practice, the number of clusters is generally unknown. A way to
tune this number consists in inspecting the instability according to the
number of clusters (Fang and Wang (2012)).
The more stable partition could be retained. The
choosenbclust
function browses a grid of values for the
number of clusters and for each one imputes the data and computes the
instability.
On the wine data set, the number of clusters suggested is clearly 3.
After building a partition with missing values, a description of each cluster can be performed variable per variable as follows:
require(reshape2)
require(ggplot2)
dat.m = melt(data.frame(wine.na, part = as.factor(part)), id.var=c("part"))
ggplot(dat.m, aes(part, value, col = part)) +
facet_wrap(variable~., scales = "free_y") +
geom_boxplot(width = 0.7)
or by investigating the relationships for each pair of variables
or from a multivariate point of view by using the principal component
analysis, as proposed in the R packages FactoMineR
(Lê, Josse, and Husson (2008)) and
missMDA
(Josse and Husson
(2016))
library(FactoMineR)
library(missMDA)
# merge the partition variable with the incomplete data set
data.pca <- cbind.data.frame(class = factor(part, levels = seq(nb.clust)),
wine.na)
# perform PCA with missing values by specifying where is the partition variable
res.imputepca <- imputePCA(data.pca, quali.sup = 1)
res.pca <- PCA(res.imputepca$completeObs, quali.sup = 1, graph = FALSE)
plot(res.pca, habillage = 1)
Finally, the consensus partition can be analysed by computing
external clustering comparison indices as proposed in the
clusterCrit
R package (Desgraupes
(2023)) as follows:
library(clusterCrit)
res.crit <- extCriteria(part, ref, crit = "all")
round(unlist(res.crit), 2)
#> czekanowski_dice folkes_mallows hubert jaccard
#> 0.87 0.87 0.80 0.77
#> kulczynski mcnemar phi precision
#> 0.87 -2.73 0.00 0.86
#> rand recall rogers_tanimoto russel_rao
#> 0.91 0.88 0.84 0.29
#> sokal_sneath1 sokal_sneath2
#> 0.62 0.95