sparsewkm
is designed for performing sparse clustering
of a dataset described by numerical, categorical, or mixed features. In
this respect, it generalizes the sparse k-means algorithm introduced in
Witten and Tibshirani (2010) for numerical features
only.
The implementation is based on the optimization of a penalized
between-class variance criterion. If the features used for clustering
are numerical only, the algorithm reduces to the one in Witten and Tibshirani (2010). If some or all the features
are categorical, sparsewkm
transforms the data using a
factor analysis step (see A. L. M. Chavent V.
Kuentz-Simonet and Saracco (2014) for the
preprocessing), and then trains a group-sparse k-means algorithm, each group being
defined by the levels of one specific feature. For more technical
details on the cost function and the optimization procedure, one may
refer to A. M. M. Chavent J. Lacaille and Olteanu
(2020).
Several arguments may be passed to sparsewkm
, but only
the first two are required:
X
is the data to be clustered. It should be a data
frame, and the categorical features should be provided as factors. Only
the features one would include in the clustering should be present.
Column and row names may be supplied to ease the
interpretation.
centers
is the number of clusters to be
computed.
The rest of the arguments are related to the choices of the
regularization parameter, or the number of iterations and random starts
in the algoritm. Default values are fixed for these parameters, one may
see help(sparsewkm)
for further details.
The sparsewkm
function returns an object of class
spwkm
(see help(sparsewkm)
for further details
on this class).
HDdata
datasetThe HDdata
consists of 270 patients described by six
numerical features and eight categorical ones. It was sampled from the
Cleveland
Heart Disease Data found in the UCI machine learning repository.
Further details about this dataset may be found with
help(HDdata)
.
## age sex cp trestbps chol fbs restecg maxhr exang oldpeak slope numv thal
## 1 70 1 4 130 322 0 2 109 0 2.4 2 3 3
## 2 67 0 3 115 564 0 2 160 0 1.6 2 0 7
## 3 57 1 2 124 261 0 0 141 0 0.3 1 0 7
## 4 64 1 4 128 263 0 0 105 1 0.2 2 1 7
## 5 74 0 2 120 269 0 2 121 1 0.2 1 1 3
## 6 65 1 4 120 177 0 0 140 0 0.4 1 0 7
## HD
## 1 presence
## 2 absence
## 3 presence
## 4 absence
## 5 absence
## 6 absence
sparsewkm
functionThe sparsewkm
function is applied to HDdata
on all features except the last one, HD
, which codes for
the presence or the absence of a heart disease. HD
was
removed from the clustering, and will be used later as a control
variable. Since the control variable has only two classes, the number of
clusters is set to 2 with the argument centers
. We shall
check after the clustering procedure whether the algorithm retrieves
these two classes.
According to the above, the algorithm converged for all values of
lambda
. In some cases, the stopping criterion may not be
satisfied and the convergence over the weights w
might not
be achieved. If this were the case, one should increase the number of
iterations itermaxw
, which by default is set to 20. Note
however that setting a larger value for this parameter would increase
the computational time, since a full k-means algorithm is trained at each
iteration.
The weights associated to each feature may be found in the matrix
W
. These weights illustrate the contribution of each
feature, numerical or categorical, to the clustering, and may be seen as
a measure of the relative importance of each feature in the
clustering.
Each column of W
contains the weights computed for a
given value of the regularization parameter lambda
. The
default setting in the sparsewkm
function selects 20 values
for lambda
, chosen uniformly between 0 (no regularization)
and a maximum value automatically tuned.
In the following, only the weights associated to the first 5 values
of lambda
are displayed.
## Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## age 1.3e-01 0.138 0.112 0.091 0.046
## trestbps 2.2e-02 0.000 0.000 0.000 0.000
## chol 1.7e-05 0.000 0.000 0.000 0.000
## maxhr 7.0e-01 0.816 0.860 0.879 0.909
## oldpeak 5.4e-01 0.449 0.428 0.424 0.402
## numv 1.4e-01 0.122 0.094 0.071 0.046
## sex 2.6e-02 0.000 0.000 0.000 0.000
## cp 1.3e-01 0.086 0.023 0.000 0.000
## fbs 6.5e-06 0.000 0.000 0.000 0.000
## restecg 3.1e-02 0.000 0.000 0.000 0.000
## exang 2.3e-01 0.182 0.138 0.106 0.046
## slope 3.2e-01 0.231 0.188 0.151 0.077
## thal 1.2e-01 0.058 0.023 0.000 0.000
One may see that, as lambda
increases, the weights of
some features are progressively set to 0. The evolution of the
regularization paths for the features used in the
clustering may be illustrated using the plot
function for
the spwkm
class. With the default settings of this
implementation, the paths associated to numerical
features are drawn with continuous lines,
while those associated to categorical features are
drawn with dotted lines.
According to the results, the numerical features maxhr
and oldpeak
, and the categorical features
slope
and exang
appear as the most
discriminant for small values of lambda
. As
lambda
increases, maxhr
only is selected by
the algorithm.
Furthermore, the weights associated to each level of the categorical
features may be also displayed. These weights are stored in the matrix
Wm
. Similarly to W
, each column of this matrix
contains the weights – associated either to the numerical features or to
the levels of the categorical features – for a given value of
lambda
.
We display here these weights asssociated to the first 5 values of
lambda
.
## Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## age 1.3e-01 1.4e-01 1.1e-01 0.091 0.046
## trestbps 2.2e-02 0.0e+00 0.0e+00 0.000 0.000
## chol 1.7e-05 0.0e+00 0.0e+00 0.000 0.000
## maxhr 7.0e-01 8.2e-01 8.6e-01 0.879 0.909
## oldpeak 5.4e-01 4.5e-01 4.3e-01 0.424 0.402
## numv 1.4e-01 1.2e-01 9.4e-02 0.071 0.046
## sex=0 2.3e-02 0.0e+00 0.0e+00 0.000 0.000
## sex=1 1.1e-02 0.0e+00 0.0e+00 0.000 0.000
## cp=1 2.1e-04 9.6e-05 4.5e-06 0.000 0.000
## cp=2 8.2e-02 5.7e-02 1.5e-02 0.000 0.000
## cp=3 3.7e-02 2.2e-02 5.3e-03 0.000 0.000
## cp=4 9.5e-02 6.2e-02 1.6e-02 0.000 0.000
## fbs=0 1.1e-06 0.0e+00 0.0e+00 0.000 0.000
## fbs=1 6.4e-06 0.0e+00 0.0e+00 0.000 0.000
## restecg=0 2.1e-02 0.0e+00 0.0e+00 0.000 0.000
## restecg=1 1.6e-02 0.0e+00 0.0e+00 0.000 0.000
## restecg=2 1.6e-02 0.0e+00 0.0e+00 0.000 0.000
## exang=0 9.9e-02 8.0e-02 6.1e-02 0.047 0.020
## exang=1 2.0e-01 1.6e-01 1.2e-01 0.095 0.041
## slope=1 2.5e-01 1.8e-01 1.5e-01 0.118 0.061
## slope=2 2.0e-01 1.4e-01 1.2e-01 0.093 0.046
## slope=3 3.1e-02 2.7e-02 2.1e-02 0.017 0.011
## thal=3 8.2e-02 4.0e-02 1.6e-02 0.000 0.000
## thal=6 3.3e-02 2.0e-02 1.0e-02 0.000 0.000
## thal=7 7.8e-02 3.7e-02 1.3e-02 0.000 0.000
The regularization paths for the levels of the categorical features
may be plotted using argument what=weights.levels
in the
plot
function. This option provides a more detailed image
of how each level in a categorical feature contributes to the
clustering. One may see here, for instance, that only the first two
levels of slope
and one level of exang
have
particularly significant contributions.
Depending on the number of levels in the categorical features,
Wm
may potentially be quite a big matrix. One may chose to
plot the regularization paths for some features only, whether numerical,
or categorical. For doing this, it is enough to add the argument
Which
in the plot
function and list the
features (one should note here that after training the
sparsewkm
function, the features are ordered differently
than in the input data, with the numerical features listed first; the
argument Which
takes into account the order in the
W
output matrix).
For the categorical features, one may equally plot the regularization
paths of the associated levels, as we do here for slope
and
exang
.
The number of selected features and the number of selected levels for
a given value of the regularization parameter lambda
provide valuable information. These two criteria may be illustrated
using options what=sel.features
and
what=sel.levels
in the plot
function.
The two curves are very similar and show how the number of selected
features relevant for the clustering rapidly decreases with
lambda
.
Clustering may also be assessed using criteria based on the evolution
of the explained variance. The latter is computed as the ratio between
the between-class variance and the global variance in the data, and
represents the ratio of information explained by the clustering. The
explained variance may be computed on all features used for clustering,
without taking the weights into account, or in a weighted fashion, by
taking the computed weights into account and thus suppressing all
features discarded by the algorithm. In the latter case and for large
lambda
, the explained weighted variance results in being
computed using one feature only, maxhr
.
These two criteria, combined with the number of selected features,
may be used to select the appropriate regularization parameter
lambda
. A good choice for
lambda
should preserve a high percentage of variance
explained by the clustering, while discarding a large number of
features. This amounts to a trade-off between the quality of the model
and its parcimony.
One may remark that with the fifth value of the regularization
parameter, lambda=0.07
, the number of features is reduced
by half, while the loss in the explained variance is close to 1%. One
may notice also that the percentage of explained variance is very low
for all clusterings, while the percetange of explained weighthed
variance is significantly larger and increasing with the increasing
penalty. This amounts to saying that most of the features in the dataset
are not discriminant, and the global quality of the clustering, computed
on all features, is rather poor. If only the most discriminant features
are selected, the explained weighted variance largely increses, whereas
the loss in the unweighted explained variance is minor.
Furthermore, if one selects the sixth value of the regularization
parameter, lambda=0.09
, only two features are kept for
clustering. The loss in the unweighted explained variance is close to
2%, but the weighted explained variance gains more than 30%.
It appears, according to these remarks, that only very few features
are actually playing a significant part in the clustering procedure. If
one prefers to discard most of the features, she may keep
maxhr
and oldpeak
, with an explained variance
of roughly 6.2% and an explained weighted variance of roughly 60%. If
one wants a larger ratio of explained variance, she would also keep
age
, numv
, exang
and
slope
besides maxhr
and oldpeak
.
In this case, the ratio of explained variance is about 7.5%, while the
ratio of explained weighted variance drops to approximately 45%.
Furthermore, according to the regularization paths, it appears that
these six features are the most discriminant and the most important for
the clustering.
Since we have a control variable for the HDdata
, we
shall use it to compare three clustering produced by the algorithm, for
three different values of lambda
. The comparison criterion
chosen here is the Adjusted Rand Index (ARI). We shall also display the
confusion matrices for the first, fifth and sixth values of
lambda
.
## [1] 0.29 0.21 0.16
##
## 1 2
## absence 128 22
## presence 40 80
##
## 1 2
## absence 123 27
## presence 45 75
##
## 1 2
## absence 34 116
## presence 73 47
According to these results, the quality of the agreement between the clustering and the control variable is decreasing with larger penalties and fewer features kept for the clustering. Nevertheless, although the accuracy is deteriorated, reducing the number of features by half leads to a loss of 3.7% only in terms of accuracy, while reducing the number of features from 13 to 2 leads to a loss of 7%. It appears here that sparse clustering may be particularly useful if one wishes to find a trade-off between clustering quality and dimensionality, while putting forward the features which mostly contribute to the clustering.
We shall also mention here that the comparison with the control variable is done for illustration purposes only. Since sparse weighted k-means is an unsupervised method, there is no reason the clustering it builds correponds to some prior partitioning defined by some control variable.
Eventually, once one selects an appropriate regularization parameter
lambda
and the associated clustering, she may consider
cluster composition. Using the function info_clust
in the
package, she may display features distributions within the clusters
(averages for numerical features, frequencies for categorical ones).
This first insight may further be completed with a more thorough
analysis, using some analysis of variance etc.
## $mean.by.clust
## $mean.by.clust$`lambda = 0.0702617509207402`
## cluster=1 cluster=2
## age 52.21 58.1
## trestbps 130.21 133.2
## chol 250.30 248.6
## maxhr 163.90 126.3
## oldpeak 0.55 1.9
## numv 0.44 1.0
##
##
## $freq.by.clust
## $freq.by.clust$`lambda = 0.0702617509207402`
## cluster=1 cluster=2
## sex=0 0.381 0.225
## sex=1 0.619 0.775
## cp=1 0.071 0.078
## cp=2 0.226 0.039
## cp=3 0.351 0.196
## cp=4 0.351 0.686
## fbs=0 0.851 0.853
## fbs=1 0.149 0.147
## restecg=0 0.548 0.382
## restecg=1 0.000 0.020
## restecg=2 0.452 0.598
## exang=0 0.821 0.422
## exang=1 0.179 0.578
## slope=1 0.679 0.157
## slope=2 0.286 0.725
## slope=3 0.036 0.118
## thal=3 0.685 0.363
## thal=6 0.018 0.108
## thal=7 0.298 0.529
## HD=absence 0.732 0.265
## HD=presence 0.268 0.735
##
##
## $lambda
## [1] 0.07