References:
VarSelLCM permits a full model selection (detection of the relevant features for clustering and selection of the number of clusters) in model-based clustering, according to classical information criteria (BIC, MICL or AIC).
Data to analyzed can be composed of continuous, integer and/or categorical features. Moreover, missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random. Thus, VarSelLCM can also be used for data imputation via mixture models.
An R-Shiny application is implemented to easily interpret the clustering results
This section performs the whole analysis of the Heart data set. Warning the univariate margin distribution are defined by class of the features: numeric columns imply Gaussian distributions, integer columns imply Poisson distribution while factor (or ordered) columns imply multinomial distribution
Attaching package: 'VarSelLCM'
The following object is masked from 'package:stats':
predict
# Data loading:
# x contains the observed variables
# z the known status (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]
# Add a missing value artificially (just to show that it works!)
x[1,1] <- NA
Clustering is performed with variable selection. Model selection is done with BIC because the number of observations is large (compared to the number of features). The number of components is between 1 and 3. Do not hesitate to use parallelization (here only two cores are used).
# Cluster analysis without variable selection
res_without <- VarSelCluster(x, gvals = 1:3, vbleSelec = FALSE, crit.varsel = "BIC")
# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, gvals = 1:3, nbcores = 2, crit.varsel = "BIC")
Comparison of the BIC for both models: variable selection permits to improve the BIC
[1] -6516.216
[1] -6509.506
Comparison of the partition accuracy. ARI is computed between the true partition (ztrue) and its estimators. ARI is an index between 0 (partitions are independent) and 1 (partitions are equals). Variable selection permits to improve the ARI. Note that ARI cannot be used for model selection in clustering, because there is no true partition.
[1] 0.2218655
[1] 0.2661321
To obtained the partition and the probabilities of classification
[1] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
[38] 1 2 2 2 2 2 2 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 2 2 2 1 2 2 1
[75] 1 1 1 2 2 2 1 1 1 2 1 2 2 1 2 1 2 2 1 1 2 1 1 1 1 2 2 1 2 1 1 1 1 1 1 2 1
[112] 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 1
[149] 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 1 2 2 1 2 1 2
[186] 2 1 1 2 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2 1 2 2 1 2 2 2 1 1 2 1 1
[223] 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2 2 1 2 2 1 1 2
[260] 2 2 1 2 2 2 2 2 2 1 1
class-1 class-2
[1,] 0.9999917 8.261350e-06
[2,] 0.6334731 3.665269e-01
[3,] 0.1755360 8.244640e-01
[4,] 1.0000000 4.442974e-08
[5,] 0.9961153 3.884667e-03
[6,] 0.9547843 4.521572e-02
To get a summary of the selected model.
Model:
Number of components: 2
Model selection has been performed according to the BIC criterion
Variable selection has been performed, 8 ( 66.67 % ) of the variables are relevant for clustering
Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate). The greater this index, the more the variable distinguishes the clusters.
Warning: Use of `df$rg` is discouraged.
ℹ Use `rg` instead.
Warning: Use of `df$discrim.power` is discouraged.
ℹ Use `discrim.power` instead.
Warning: Use of `df$variables` is discouraged.
ℹ Use `variables` instead.
Warning: Use of `df$discrim.power` is discouraged.
ℹ Use `discrim.power` instead.
Warning: Use of `df$rg` is discouraged.
ℹ Use `rg` instead.
Warning: Use of `df$discrim.power` is discouraged.
ℹ Use `discrim.power` instead.
Warning: Use of `df$variables` is discouraged.
ℹ Use `variables` instead.
Distribution of the most discriminative variable per clusters
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the VarSelLCM package.
Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Warning: Use of `df$x` is discouraged.
ℹ Use `x` instead.
Empirical and theoretical distributions of the most discriminative variable (to check that the distribution is well-fitted)
# Empirical and theoretical distributions (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")
Warning: Use of `df$x` is discouraged.
ℹ Use `x` instead.
Use of `df$x` is discouraged.
ℹ Use `x` instead.
Distribution of a categorical variable per clusters
Warning: Use of `df$probabilties` is discouraged.
ℹ Use `probabilties` instead.
Use of `df$probabilties` is discouraged.
ℹ Use `probabilties` instead.
Use of `df$probabilties` is discouraged.
ℹ Use `probabilties` instead.
To have details about the selected model
Data set:
Number of individuals: 270
Number of continuous variables: 3
Number of count variables: 1
Percentile of missing values for the integer variables: 0.37
Number of categorical variables: 8
Model:
Number of components: 2
Model selection has been performed according to the BIC criterion
Variable selection has been performed, 8 ( 66.67 % ) of the variables are relevant for clustering
Information Criteria:
loglike: -6403.136
AIC: -6441.136
BIC: -6509.506
ICL: -6638.116
To print the parameters
An object of class "VSLCMparam"
Slot "pi":
class-1 class-2
0.5221145 0.4778855
Slot "paramContinuous":
An object of class "VSLCMparamContinuous"
Slot "pi":
numeric(0)
Slot "mu":
class-1 class-2
RestBloodPressure 131.3444 131.3444
SerumCholestoral 249.6593 249.6593
MaxHeartRate 135.4167 165.2587
Slot "sd":
class-1 class-2
RestBloodPressure 17.82850 17.82850
SerumCholestoral 51.59043 51.59043
MaxHeartRate 20.98140 13.14845
Slot "paramInteger":
An object of class "VSLCMparamInteger"
Slot "pi":
numeric(0)
Slot "lambda":
class-1 class-2
Age 58.11336 50.32059
Slot "paramCategorical":
An object of class "VSLCMparamCategorical"
Slot "pi":
numeric(0)
Slot "alpha":
$Sex
0 1
class-1 0.2358080 0.7641920
class-2 0.4166342 0.5833658
$ChestPainType
1 2 3 4
class-1 0.08922390 0.03291642 0.1738648 0.7039949
class-2 0.05752211 0.28954511 0.4223088 0.2306240
$FastingBloodSugar
0 1
class-1 0.8518519 0.1481481
class-2 0.8518519 0.1481481
$ResElectrocardiographic
0 1 2
class-1 0.4851852 0.007407407 0.5074074
class-2 0.4851852 0.007407407 0.5074074
$ExerciseInduced
0 1
class-1 0.4484677 0.55153229
class-2 0.9128104 0.08718958
$Slope
1 2 3
class-1 0.2266448 0.6884261 0.08492909
class-2 0.7599036 0.1933824 0.04671403
$MajorVessels
0 1 2 3
class-1 0.4104437 0.2830465 0.17928526 0.127224584
class-2 0.7915996 0.1402682 0.05987792 0.008254222
$Thal
3 6 7
class-1 0.3183113 9.931125e-02 0.5823775
class-2 0.8302575 1.682043e-08 0.1697425
Probabilities of classification for new observations
class-1 class-2
[1,] 0.9999914 8.635437e-06
[2,] 0.6231309 3.768691e-01
[3,] 0.1692185 8.307815e-01
The model can be used for imputation (of the clustered data or of a new observation)
# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)
Age Sex ChestPainType RestBloodPressure SerumCholestoral FastingBloodSugar
1 NA 1 4 130 322 0
2 50 1 4 130 322 0
ResElectrocardiographic MaxHeartRate ExerciseInduced Slope MajorVessels Thal
1 2 109 0 2 3 3
2 2 109 0 2 3 3