Group-sparse weighted k-means for numerical data

Basic function description

groupsparsewkm is designed for clustering numerical data described by a set of features, while simultaneously selecting the most discriminant groups of features. The groups of features are supposed priorly known and provided as an argument to the function. This implementation generalizes the sparse k-means algorithm introduced in Witten and Tibshirani (2010), and is based on the optimization of a penalized weighted between-class variance criterion. For more technical details on the penalty term and on the optimization procedure, one may refer to M. Chavent and Olteanu (2020).

Arguments

Several arguments may be passed to groupsparsewkm, but only the first two are required:

  • X is the numerical data that will be clustered. It should have the format of a matrix or a data frame, and all entries should be numerical features. Only the features one would include in the clustering should be present. Column and row names may be supplied to ease the interpretation;
  • centers is the number of clusters to be computed.

The rest of the arguments are related to the choices of the regularization parameter, the prior partitioning of the features into groups, the number of iterations and random starts in the algoritm or the fact of scaling the data. Default values are fixed for these parameters, one may see help(groupsparsewkm) for further details.

Output

The groupsparsewkm function returns an object of class spwkm (see help(groupsparsewkm) for further details on this class).

A case study: Mice dataset

The DataMice dataset consists of repeated measurements of 68 proteins on a sample of 72 mice (12 or 15 values independently measured for each protein). The data was first described in C. Higuera and Cios (2015), and it was processed here in order to discard some proteins and measurements containing missing data.

library(vimpclust)
data(DataMice)
DataMice[1:10, 1:4]
##    Protein_1_Meas_1 Protein_1_Meas_2 Protein_1_Meas_3 Protein_1_Meas_4
## 1              0.50             0.51             0.51             0.44
## 2              0.74             0.71             0.70             0.68
## 3              0.63             0.65             0.64             0.57
## 4              0.56             0.52             0.51             0.51
## 5              0.38             0.39             0.40             0.38
## 6              0.65             0.62             0.64             0.58
## 7              0.45             0.46             0.45             0.36
## 8              0.41             0.40             0.43             0.37
## 9              0.74             0.76             0.76             0.67
## 10             0.78             0.78             0.77             0.65

The data may be priorly split as follows:

  • According to the Genotype, 38 mice are in the control (c) group, and 34 in the trisomic (t) group;
  • According to the Behavior, 35 mice were stimulated to learn (CS), and 37 were not (SC);
  • According to the Treatment, 38 mice received one (m), and 34 did not (s).

When mixing all the information, the data may be priorly split into 8 clusters, with the following frequencies:

summary(DataMice$Class.mouse)
## c-CS-m c-CS-s c-SC-m c-SC-s t-CS-m t-CS-s t-SC-m t-SC-s 
##     10      9     10      9      9      7      9      9

Further details about this dataset may be found with help(DataMice).

Training the groupsparsewkm function

For this dataset, the groups of features may be naturally defined: each protein represents a group, and the measurements associated to it represent the features of the group. The index vector, containing the index of the group associated to each feature, is created using the column names of DataMice.

names(DataMice)[1:20]
##  [1] "Protein_1_Meas_1"  "Protein_1_Meas_2"  "Protein_1_Meas_3" 
##  [4] "Protein_1_Meas_4"  "Protein_1_Meas_5"  "Protein_1_Meas_6" 
##  [7] "Protein_1_Meas_7"  "Protein_1_Meas_8"  "Protein_1_Meas_9" 
## [10] "Protein_1_Meas_10" "Protein_1_Meas_11" "Protein_1_Meas_12"
## [13] "Protein_2_Meas_1"  "Protein_2_Meas_2"  "Protein_2_Meas_3" 
## [16] "Protein_2_Meas_4"  "Protein_2_Meas_5"  "Protein_2_Meas_6" 
## [19] "Protein_2_Meas_7"  "Protein_2_Meas_8"
index <- unlist(strsplit(names(DataMice)[1:(dim(DataMice)[2]-5)], split="_"))
index <- as.numeric(index[(1:length(index)%%4==2)])

The number of clusters, centers is fixed to 8, based on the additional prior knowledge on the dataset that was described above. Although there is no reason for an unsupervised method to retrieve the partitioning defined by some supplementary features, the comparison of the unsupervised clustering with a prior partitioning is interesting for illustration.

All features are scaled to have zero mean and unit variance with the default setting of the groupsparsewkm function.

set.seed(1407)
res.mice <- groupsparsewkm(X = DataMice[,(1:length(index))], centers = 8, index = index, verbose = 1)

According to the above, the algorithm converged for all values of lambda. In some cases, the stopping criterion may not be satisfied and the convergence over the weights w might not be achieved. If this were the case, one should increase the number of iterations itermaxw, which by default is set to 20. Note however that setting a larger value for this parameter would increase the computational time, since a full k-means algorithm is trained at each iteration.

Results

The weights associated to each group of features may be found in the matrix Wg. These weights illustrate the contribution of each group of features to the clustering and may be seen as a measure of the importance of each group for the clustering. Each column of Wg contains the weights computed for a given value of the regularization parameter lambda. The default setting in the groupsparsewkm function selects 20 values for the regularization pameter lambda, chosen uniformly between 0 (no regularization) and a maximum value automatically tuned.

res.mice$Wg[1:20,1:5]
##          Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## Group 1     0.184    0.189    0.195    0.203    0.211
## Group 2     0.175    0.180    0.185    0.191    0.199
## Group 3     0.128    0.129    0.129    0.130    0.130
## Group 4     0.109    0.108    0.106    0.104    0.102
## Group 5     0.133    0.134    0.135    0.136    0.137
## Group 6     0.112    0.111    0.110    0.108    0.106
## Group 7     0.103    0.102    0.100    0.097    0.094
## Group 8     0.065    0.060    0.054    0.047    0.038
## Group 9     0.071    0.066    0.061    0.055    0.047
## Group 10    0.150    0.153    0.155    0.159    0.162
## Group 11    0.178    0.183    0.188    0.195    0.203
## Group 12    0.139    0.140    0.142    0.144    0.146
## Group 13    0.124    0.124    0.124    0.124    0.123
## Group 14    0.111    0.110    0.109    0.108    0.106
## Group 15    0.103    0.102    0.100    0.097    0.094
## Group 16    0.108    0.107    0.106    0.104    0.101
## Group 17    0.126    0.127    0.127    0.127    0.127
## Group 18    0.134    0.135    0.136    0.137    0.139
## Group 19    0.096    0.094    0.091    0.088    0.084
## Group 20    0.129    0.130    0.130    0.131    0.132

Each row of Wg contains the weights associated to a group of features for all the values of the regularization parameter, or the so-called regularization paths. These may be further illustrated graphically, by calling the plot function for the spwkm class (see help(plot.spwkm) for more details).

plot(res.mice, what="weights.groups")

For the Mice data, one may see that four proteins particularly (associated to groups 1, 2, 11 and 21) appear as the most discriminant, as the regularization parameter increases. Other proteins, such as the one associated to group 30 have an interesting behaviour: its weight becomes large for large values of the regularization term, thus for a heavy penalty term. Indeed, at the 13th value of the lambda a significant drop in the number of selected groups occurs, from 41 selected groups to 13. It is at this point that group 30 becomes extremely significant for the clustering.

If one wants to look into more details and assess the respective contribution of each feature, she may look at the matrix W, which contains the weights associated to each feature. These weights may be read as the relative importance of each feature for the clustering. Depending on the number of features in each group, W may potentially be a large matrix, and one may also want to focus on the features belonging to non-zero weigthed groups.

res.mice$W[1:20,1:5]
##                   Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## Protein_1_Meas_1     0.056    0.058    0.060    0.062    0.065
## Protein_1_Meas_2     0.052    0.054    0.055    0.057    0.060
## Protein_1_Meas_3     0.051    0.053    0.055    0.057    0.059
## Protein_1_Meas_4     0.054    0.056    0.058    0.060    0.062
## Protein_1_Meas_5     0.054    0.056    0.057    0.060    0.062
## Protein_1_Meas_6     0.053    0.055    0.056    0.058    0.061
## Protein_1_Meas_7     0.047    0.049    0.050    0.052    0.055
## Protein_1_Meas_8     0.050    0.052    0.053    0.055    0.058
## Protein_1_Meas_9     0.053    0.055    0.057    0.059    0.061
## Protein_1_Meas_10    0.052    0.054    0.056    0.058    0.060
## Protein_1_Meas_11    0.055    0.057    0.059    0.061    0.064
## Protein_1_Meas_12    0.057    0.058    0.060    0.062    0.065
## Protein_2_Meas_1     0.053    0.054    0.056    0.058    0.060
## Protein_2_Meas_2     0.055    0.057    0.058    0.060    0.062
## Protein_2_Meas_3     0.055    0.056    0.058    0.060    0.062
## Protein_2_Meas_4     0.049    0.050    0.051    0.053    0.055
## Protein_2_Meas_5     0.050    0.052    0.053    0.055    0.057
## Protein_2_Meas_6     0.051    0.052    0.054    0.056    0.058
## Protein_2_Meas_7     0.046    0.047    0.048    0.050    0.052
## Protein_2_Meas_8     0.048    0.050    0.051    0.053    0.055

The regularization path for each feature may be also illustrated using the plot function. For the Mice dataset, one may easily see that the features within each group are quite redundant (let us recall here that one group is made of repeated measurements of the same protein), their regularization paths being very similar.

plot(res.mice, what = "weights.features")

By specifying the supplementary argument Which in the plot function, one may focus on the regularization paths of specific groups of features, represented either as the regularization path of the group, or the regularization paths of the corresponding features. Here below, proteins 1, 2 and 30 were selected for illustration.

plot(res.mice, what = "weights.groups", Which=c(1,2,30))

plot(res.mice, what = "weights.features", Which=c(1,2,30))

Additional plots

A valuable piece of information is given by the number of selected groups or the number of selected features for a given value of the regularization parameter lambda. The evolution of the number of features may be graphically illustrated as follows:

plot(res.mice, what="sel.groups")

plot(res.mice, what="sel.features")

Since the measurements for each protein are quite redundant, and the number of measurements for each protein quite similar, the curves representing the evolution of the selected groups and of the selected features are very similar. A significant drop may be noticed after the 12th value of lambda, where only 13 proteins among the 68 are preserved for clustering.

Besides the selected number of groups of features or the selected number of features, the evolution of some criteria related to the quality of the clustering are equally important for understanding and commenting the results.

For example, using the argument what="expl.var" in the plot function, one may illustrate the evolution of the explained variance, as a function of the regularization parameter lambda. The explained variance is computed as the ratio between the between-class variance and the global variance in the data, and represents the ratio of information explained by the clustering. Let us remark that this criterion is independent of the sparse algorithm trained here, which maximizes the weighted between sum-of-squares penalized by a group-penalty term. The explained variance, computed on all features and without applying any weights, illustrates how the information is preserved when discarding an increasing number of features.

plot(res.mice, what="expl.var")

The number-of-selected-groups curve and the explained-variance curve may be used to select the appropriate regularization parameter lambda for the clustering. A good choice of lambda should preserve a high percentage of variance explained by the clustering, while discarding a large number of features. This actually amounts to a trade-off between the quality of the model and its relative parcimony. With this is mind, one may easily see that by selecting lambda=0.45, the explained variance remains very close to when using all features, whereas the number of groups and the number of features was reduced by a third. For lambda=0.48, if one accepts to “loose” a third of the explained variance (while remaining above 30%), the number of groups and the number of features may be reduced by more than 80% (13 groups are preserved among 68, and 171 features among 900). Hence, one may use this algorithm for drastically reducing the dimensionality of the data, while still preserving a significant clustering.

Another graphical option includes the illustration of the weighthed explained variance (the weights computed by the algorithm are taken into account and applied to the features when computing the between sum-of-squares and the total sum-of-squares). In this case, since the criterion takes into account the most discriminant features only, it is increasing with the penalty, and a significant jump may be seen at the same spot as for the explained variance, except that here, the weighted explained variance improves by more than 20%.

plot(res.mice, what="w.expl.var")

Other criteria such as the gap statistic could be efficiently used for selecting the regularization parameter (and also the number of clusters). They are not implemented here (yet!), but may be easily retrived in other packages and combined with spwkm objects.

Comparing the clustering with the “ground truth”

As mentioned above, the DataMice observations are known to belong to some priorly defined clusters, defined by the Genotype, Treatment or Behaviour. In order to compare the resulting clusterings of the group-sparse algorithm (for the various values of lambda) with the priorly defined clusters, the Adjusted Rand Index (ARI) is computed here below.

sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Class.mouse)})
##  [1] 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.152 0.133 0.133 0.097
## [13] 0.340 0.211 0.229 0.229 0.171 0.171 0.174 0.192
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Genotype)})
##  [1]  0.0405  0.0405  0.0405  0.0405  0.0405  0.0405  0.0405  0.0405  0.0405
## [10]  0.0105  0.0105 -0.0043  0.0670  0.0274  0.0275  0.0330  0.0256  0.0256
## [19]  0.0490  0.0584
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Treatment)})
##  [1]  0.0058  0.0058  0.0058  0.0058  0.0058  0.0058  0.0058  0.0058  0.0058
## [10]  0.0089  0.0089 -0.0043  0.0433  0.0243  0.0243  0.0062 -0.0091 -0.0091
## [19] -0.0062 -0.0047
sapply(1:length(res.mice$lambda), function(x) {mclust::adjustedRandIndex(res.mice$cluster[,x],DataMice$Behaviour)})
##  [1] 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.21 0.21 0.20 0.26 0.28 0.28
## [16] 0.28 0.26 0.26 0.24 0.24

According to the above values, the 8 clusters computed with the sparse-group k-means algorithm are not much related to the 8 priorly defined clusters. As we’ve already mentioned, there is no prior reason for correlation between the clustering output and the partitioning defined by the Genotype, the Treatment and the Behaviour. The clusters identified by the algorithm may correspond to a completely different structure in the data. Nevertheless, we should mention here that the proteins identified by the algorithm as the most discriminant or having significant weights for all values of lambda – groups 1, 2, 10, 11, 21, 25, 30, 32, 68 – correspond to those identified in C. Higuera and Cios (2015) as discriminant for Genotype, Treatment or Behaviour. Furthermore, the algorithm implemented in groupsparsewkm has also the advantage of fully selecting or discarding one protein and its associated measurements thanks to the group approach. Group-sparse clustering is thus offering a complete approach for both clustering and selecting the most discriminant groups of features.

Bibliography

C. Higuera, K. J. Gardiner, and K. J Cios. 2015. “Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of down Syndrome.” PLOS ONE 10 (6): 1–28. https://doi.org/10.1371/journal.pone.0129126.
M. Chavent, A. Mourer, J. Lacaille, and M. Olteanu. 2020. “Sparse k-Means for Mixed Data via Group-Sparse Clustering.” In Proceedings of ESANN.
Witten, D. M., and R. Tibshirani. 2010. “A Framework for Feature Selection in Clustering.” Journal of the American Statistical Association 105 (490): 713–26.