groupsparsewkm
is designed for clustering numerical data
described by a set of features, while simultaneously selecting the most
discriminant groups of features. The groups of features are supposed
priorly known and provided as an argument to the function. This
implementation generalizes the sparse k-means algorithm introduced in
Witten and Tibshirani (2010), and is based on the
optimization of a penalized weighted between-class variance criterion.
For more technical details on the penalty term and on the optimization
procedure, one may refer to M. Chavent and
Olteanu (2020).
Several arguments may be passed to groupsparsewkm
, but
only the first two are required:
X
is the numerical data that will be clustered. It
should have the format of a matrix or a data frame, and all entries
should be numerical features. Only the features one would include in the
clustering should be present. Column and row names may be supplied to
ease the interpretation;centers
is the number of clusters to be computed.The rest of the arguments are related to the choices of the
regularization parameter, the prior partitioning of the features into
groups, the number of iterations and random starts in the algoritm or
the fact of scaling the data. Default values are fixed for these
parameters, one may see help(groupsparsewkm)
for further
details.
The groupsparsewkm
function returns an object of class
spwkm
(see help(groupsparsewkm)
for further
details on this class).
Mice
datasetThe DataMice
dataset consists of repeated measurements
of 68 proteins on a sample of 72 mice (12 or 15 values independently
measured for each protein). The data
was first described in C. Higuera and Cios (2015), and it was processed here in order
to discard some proteins and measurements containing missing data.
## Protein_1_Meas_1 Protein_1_Meas_2 Protein_1_Meas_3 Protein_1_Meas_4
## 1 0.50 0.51 0.51 0.44
## 2 0.74 0.71 0.70 0.68
## 3 0.63 0.65 0.64 0.57
## 4 0.56 0.52 0.51 0.51
## 5 0.38 0.39 0.40 0.38
## 6 0.65 0.62 0.64 0.58
## 7 0.45 0.46 0.45 0.36
## 8 0.41 0.40 0.43 0.37
## 9 0.74 0.76 0.76 0.67
## 10 0.78 0.78 0.77 0.65
The data may be priorly split as follows:
Genotype
, 38 mice are in the control
(c) group, and 34 in the trisomic (t) group;Behavior
, 35 mice were stimulated to
learn (CS), and 37 were not (SC);Treatment
, 38 mice received one (m),
and 34 did not (s).When mixing all the information, the data may be priorly split into 8 clusters, with the following frequencies:
## c-CS-m c-CS-s c-SC-m c-SC-s t-CS-m t-CS-s t-SC-m t-SC-s
## 10 9 10 9 9 7 9 9
Further details about this dataset may be found with
help(DataMice)
.
groupsparsewkm
functionFor this dataset, the groups of features may be naturally defined:
each protein represents a group, and the measurements associated to it
represent the features of the group. The index
vector,
containing the index of the group associated to each feature, is created
using the column names of DataMice
.
## [1] "Protein_1_Meas_1" "Protein_1_Meas_2" "Protein_1_Meas_3"
## [4] "Protein_1_Meas_4" "Protein_1_Meas_5" "Protein_1_Meas_6"
## [7] "Protein_1_Meas_7" "Protein_1_Meas_8" "Protein_1_Meas_9"
## [10] "Protein_1_Meas_10" "Protein_1_Meas_11" "Protein_1_Meas_12"
## [13] "Protein_2_Meas_1" "Protein_2_Meas_2" "Protein_2_Meas_3"
## [16] "Protein_2_Meas_4" "Protein_2_Meas_5" "Protein_2_Meas_6"
## [19] "Protein_2_Meas_7" "Protein_2_Meas_8"
index <- unlist(strsplit(names(DataMice)[1:(dim(DataMice)[2]-5)], split="_"))
index <- as.numeric(index[(1:length(index)%%4==2)])
The number of clusters, centers
is fixed to 8, based on
the additional prior knowledge on the dataset that was described above.
Although there is no reason for an unsupervised method to retrieve the
partitioning defined by some supplementary features, the comparison of
the unsupervised clustering with a prior partitioning is interesting for
illustration.
All features are scaled to have zero mean and unit variance with the
default setting of the groupsparsewkm
function.
set.seed(1407)
res.mice <- groupsparsewkm(X = DataMice[,(1:length(index))], centers = 8, index = index, verbose = 1)
According to the above, the algorithm converged for all values of
lambda
. In some cases, the stopping criterion may not be
satisfied and the convergence over the weights w
might not
be achieved. If this were the case, one should increase the number of
iterations itermaxw
, which by default is set to 20. Note
however that setting a larger value for this parameter would increase
the computational time, since a full k-means algorithm is trained at each
iteration.
The weights associated to each group of features may be found in the
matrix Wg
. These weights illustrate the contribution of
each group of features to the clustering and may be seen as a measure of
the importance of each group for the clustering. Each column of
Wg
contains the weights computed for a given value of the
regularization parameter lambda
. The default setting in the
groupsparsewkm
function selects 20 values for the
regularization pameter lambda
, chosen uniformly between 0
(no regularization) and a maximum value automatically tuned.
## Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## Group 1 0.184 0.189 0.195 0.203 0.211
## Group 2 0.175 0.180 0.185 0.191 0.199
## Group 3 0.128 0.129 0.129 0.130 0.130
## Group 4 0.109 0.108 0.106 0.104 0.102
## Group 5 0.133 0.134 0.135 0.136 0.137
## Group 6 0.112 0.111 0.110 0.108 0.106
## Group 7 0.103 0.102 0.100 0.097 0.094
## Group 8 0.065 0.060 0.054 0.047 0.038
## Group 9 0.071 0.066 0.061 0.055 0.047
## Group 10 0.150 0.153 0.155 0.159 0.162
## Group 11 0.178 0.183 0.188 0.195 0.203
## Group 12 0.139 0.140 0.142 0.144 0.146
## Group 13 0.124 0.124 0.124 0.124 0.123
## Group 14 0.111 0.110 0.109 0.108 0.106
## Group 15 0.103 0.102 0.100 0.097 0.094
## Group 16 0.108 0.107 0.106 0.104 0.101
## Group 17 0.126 0.127 0.127 0.127 0.127
## Group 18 0.134 0.135 0.136 0.137 0.139
## Group 19 0.096 0.094 0.091 0.088 0.084
## Group 20 0.129 0.130 0.130 0.131 0.132
Each row of Wg
contains the weights associated to a
group of features for all the values of the regularization parameter, or
the so-called regularization paths. These may be
further illustrated graphically, by calling the plot
function for the spwkm
class (see
help(plot.spwkm)
for more details).
For the Mice
data, one may see that four proteins
particularly (associated to groups 1, 2, 11 and 21) appear as the most
discriminant, as the regularization parameter increases. Other proteins,
such as the one associated to group 30 have an interesting behaviour:
its weight becomes large for large values of the regularization term,
thus for a heavy penalty term. Indeed, at the 13th value of the
lambda
a significant drop in the number of selected groups
occurs, from 41 selected groups to 13. It is at this point that group 30
becomes extremely significant for the clustering.
If one wants to look into more details and assess the respective
contribution of each feature, she may look at the matrix W
,
which contains the weights associated to each feature. These weights may
be read as the relative importance of each feature for the clustering.
Depending on the number of features in each group, W
may
potentially be a large matrix, and one may also want to focus on the
features belonging to non-zero weigthed groups.
## Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## Protein_1_Meas_1 0.056 0.058 0.060 0.062 0.065
## Protein_1_Meas_2 0.052 0.054 0.055 0.057 0.060
## Protein_1_Meas_3 0.051 0.053 0.055 0.057 0.059
## Protein_1_Meas_4 0.054 0.056 0.058 0.060 0.062
## Protein_1_Meas_5 0.054 0.056 0.057 0.060 0.062
## Protein_1_Meas_6 0.053 0.055 0.056 0.058 0.061
## Protein_1_Meas_7 0.047 0.049 0.050 0.052 0.055
## Protein_1_Meas_8 0.050 0.052 0.053 0.055 0.058
## Protein_1_Meas_9 0.053 0.055 0.057 0.059 0.061
## Protein_1_Meas_10 0.052 0.054 0.056 0.058 0.060
## Protein_1_Meas_11 0.055 0.057 0.059 0.061 0.064
## Protein_1_Meas_12 0.057 0.058 0.060 0.062 0.065
## Protein_2_Meas_1 0.053 0.054 0.056 0.058 0.060
## Protein_2_Meas_2 0.055 0.057 0.058 0.060 0.062
## Protein_2_Meas_3 0.055 0.056 0.058 0.060 0.062
## Protein_2_Meas_4 0.049 0.050 0.051 0.053 0.055
## Protein_2_Meas_5 0.050 0.052 0.053 0.055 0.057
## Protein_2_Meas_6 0.051 0.052 0.054 0.056 0.058
## Protein_2_Meas_7 0.046 0.047 0.048 0.050 0.052
## Protein_2_Meas_8 0.048 0.050 0.051 0.053 0.055
The regularization path for each feature may be also illustrated
using the plot
function. For the Mice
dataset,
one may easily see that the features within each group are quite
redundant (let us recall here that one group is made of repeated
measurements of the same protein), their regularization paths being very
similar.
By specifying the supplementary argument Which
in the
plot
function, one may focus on the regularization paths of
specific groups of features, represented either as the regularization
path of the group, or the regularization paths of the corresponding
features. Here below, proteins 1, 2 and 30 were selected for
illustration.
A valuable piece of information is given by the number of selected
groups or the number of selected features for a given value of the
regularization parameter lambda
. The evolution of the
number of features may be graphically illustrated as follows:
Since the measurements for each protein are quite redundant, and the
number of measurements for each protein quite similar, the curves
representing the evolution of the selected groups and of the selected
features are very similar. A significant drop may be noticed after the
12th value of lambda
, where only 13 proteins among the 68
are preserved for clustering.
Besides the selected number of groups of features or the selected number of features, the evolution of some criteria related to the quality of the clustering are equally important for understanding and commenting the results.
For example, using the argument what="expl.var"
in the
plot
function, one may illustrate the evolution of the
explained variance, as a function of the regularization parameter
lambda
. The explained variance is computed as the ratio
between the between-class variance and the global variance in the data,
and represents the ratio of information explained by the clustering. Let
us remark that this criterion is independent of the sparse algorithm
trained here, which maximizes the weighted between sum-of-squares
penalized by a group-penalty term. The explained variance, computed on
all features and without applying any weights, illustrates how the
information is preserved when discarding an increasing number of
features.
The number-of-selected-groups curve and the explained-variance curve
may be used to select the appropriate regularization parameter
lambda
for the clustering. A good choice
of lambda
should preserve a high percentage of variance
explained by the clustering, while discarding a large number of
features. This actually amounts to a trade-off between the quality of
the model and its relative parcimony. With this is mind, one may easily
see that by selecting lambda=0.45
, the explained variance
remains very close to when using all features, whereas the number of
groups and the number of features was reduced by a third. For
lambda=0.48
, if one accepts to “loose” a third of the
explained variance (while remaining above 30%), the number of groups and
the number of features may be reduced by more than 80% (13 groups are
preserved among 68, and 171 features among 900). Hence, one may use this
algorithm for drastically reducing the dimensionality of the data, while
still preserving a significant clustering.
Another graphical option includes the illustration of the weighthed explained variance (the weights computed by the algorithm are taken into account and applied to the features when computing the between sum-of-squares and the total sum-of-squares). In this case, since the criterion takes into account the most discriminant features only, it is increasing with the penalty, and a significant jump may be seen at the same spot as for the explained variance, except that here, the weighted explained variance improves by more than 20%.
Other criteria such as the gap statistic could be efficiently used
for selecting the regularization parameter (and also the number of
clusters). They are not implemented here (yet!), but
may be easily retrived in other packages and combined with
spwkm
objects.
As mentioned above, the DataMice
observations are known
to belong to some priorly defined clusters, defined by the
Genotype
, Treatment
or Behaviour
.
In order to compare the resulting clusterings of the group-sparse
algorithm (for the various values of lambda
) with the
priorly defined clusters, the Adjusted Rand Index (ARI) is computed here
below.
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Class.mouse)})
## [1] 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.133 0.133 0.097
## [13] 0.340 0.211 0.229 0.229 0.171 0.171 0.174 0.192
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Genotype)})
## [1] 0.0405 0.0405 0.0405 0.0405 0.0405 0.0405 0.0405 0.0405 0.0405
## [10] 0.0105 0.0105 -0.0043 0.0670 0.0274 0.0275 0.0330 0.0256 0.0256
## [19] 0.0490 0.0584
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Treatment)})
## [1] 0.0058 0.0058 0.0058 0.0058 0.0058 0.0058 0.0058 0.0058 0.0058
## [10] 0.0089 0.0089 -0.0043 0.0433 0.0243 0.0243 0.0062 -0.0091 -0.0091
## [19] -0.0062 -0.0047
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Behaviour)})
## [1] 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.21 0.21 0.20 0.26 0.28 0.28
## [16] 0.28 0.26 0.26 0.24 0.24
According to the above values, the 8 clusters computed with the
sparse-group k-means algorithm
are not much related to the 8 priorly defined clusters. As we’ve already
mentioned, there is no prior reason for correlation between the
clustering output and the partitioning defined by the
Genotype
, the Treatment
and the
Behaviour
. The clusters identified by the algorithm may
correspond to a completely different structure in the data.
Nevertheless, we should mention here that the proteins identified by the
algorithm as the most discriminant or having significant weights for all
values of lambda
– groups 1, 2, 10, 11, 21, 25, 30, 32, 68
– correspond to those identified in C. Higuera
and Cios (2015) as discriminant for
Genotype
, Treatment
or Behaviour
.
Furthermore, the algorithm implemented in groupsparsewkm
has also the advantage of fully selecting or discarding one protein and
its associated measurements thanks to the group approach. Group-sparse
clustering is thus offering a complete approach for both clustering and
selecting the most discriminant groups of features.