Package 'sparcl'

Title: Perform Sparse Hierarchical Clustering and Sparse K-Means Clustering
Description: Implements the sparse clustering methods of Witten and Tibshirani (2010): "A framework for feature selection in clustering"; published in Journal of the American Statistical Association 105(490): 713-726.
Authors: Daniela M. Witten and Robert Tibshirani
Maintainer: Daniela Witten <[email protected]>
License: GPL-2
Version: 1.0.4
Built: 2024-10-31 19:49:01 UTC
Source: CRAN

Help Index


Performs sparse hierarchical and sparse K-means clustering

Description

Implements the sparse clustering methods of Witten and Tibshirani (2010) "A framework for feature selection in clustering", Journal Amer. Stat. Assocn. 105(490): 713-726.

Details

Package: sparcl
Type: Package
Version: 1.0.3
Date: 2013-1-02
License: GPL-2
LazyLoad: yes

The main functions are KMeansSparseCluster and HierarchicalSparseCluster. Tuning parameters for these methods are chosen by KMeansSparseCluster.permute and HierarchicalSparseCluster.permute.

Author(s)

Daniela M. Witten and Robert Tibshirani

Maintainer: Daniela Witten <[email protected]>

References

Witten and Tibshirani (2010) A framework for feature selection in clustering. Journal Amer. Stat. Assocn. 105(490): 713-726.


Color the leaves in a hierarchical clustering dendrogram

Description

Pass in the output of "hclust" and a class label for each observation. A colored dendrogram will result, with the leaf colors indicating the classes.

Usage

ColorDendrogram(hc, y, main = "", branchlength = 0.7, labels = NULL, xlab = NULL, 
sub="NULL",ylab = "", cex.main = NULL)

Arguments

hc

The output of running "hclust" on a nxn dissimilarity matrix

y

A vector of n class labels for the observations that were clustered using "hclust". If labels are numeric from 1 to K, then colors will be determine automatically. Otherwise the labels can take the form of colors (e.g. c("red", "red", "orange", "orange")).

main

The main title for the dendrogram.

branchlength

How long to make the colored part of the branches. Adjustment will be needed for each dissimilarity matrix

labels

The labels for the n observations.

xlab

X-axis label.

sub

Sub-x-axis label.

ylab

Y-axis label.

cex.main

The amount by which to enlarge the main title for the figure.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster, HierarchicalSparseCluster.permute

Examples

# Generate 2-class data
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)

Hierarchical sparse clustering

Description

Performs sparse hierarchical clustering. If $d_ii'j$ is the dissimilarity between observations i and i' for feature j, seek a sparse weight vector w and then use $(sum_j (d_ii'j w_j))_ii'$ as a nxn dissimilarity matrix for hierarchical clustering.

Usage

HierarchicalSparseCluster(x=NULL, dists=NULL,
method=c("average","complete", "single","centroid"),
wbound=NULL,niter=15,dissimilarity=c("squared.distance","absolute.value"),
 uorth=NULL,
silent=FALSE,cluster.features=FALSE,method.features=c("average", "complete",
"single","centroid"),output.cluster.files=FALSE,
outputfile.prefix="output",genenames=NULL,genedesc=NULL,standardize.arrays=FALSE)
## S3 method for class 'HierarchicalSparseCluster'
print(x,...)
## S3 method for class 'HierarchicalSparseCluster'
plot(x,...)

Arguments

x

A nxp data matrix; n is the number of observations and p the number of features. If NULL, then specify dists instead.

dists

For advanced users, can be entered instead of x. If HierarchicalSparseCluster has already been run on this data, then the dists value of the previous output can be entered here. Under normal circumstances, leave this argument NULL and pass in x instead.

method

The type of linkage to use in the hierarchical clustering - "single", "complete", "centroid", or "average".

wbound

The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering. Should be greater than 1.

niter

The number of iterations to perform in the sparse hierarchical clustering algorithm.

dissimilarity

The type of dissimilarity measure to use. One of "squared.distance" or "absolute.value". Only use this if x was passed in (rather than dists).

uorth

If complementary sparse clustering is desired, then this is the nxn dissimilarity matrix obtained in the original sparse clustering.

standardize.arrays

Should the arrays be standardized? Default is FALSE.

silent

Print out progress?

cluster.features

Not for use.

method.features

Not for use.

output.cluster.files

Not for use.

outputfile.prefix

Not for use.

genenames

Not for use.

genedesc

Not for use.

...

not used.

Details

We seek a p-vector of weights w (one per feature) and a nxn matrix U that optimize

$maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1, ||w||_1 <= wbound, w_j >= 0, sum_ii' U_ii'^2 <= 1$.

Here, $d_ii'j$ is the dissimilarity between observations i and i' with along feature j. The resulting matrix U is used as a dissimilarity matrix for hierarchical clustering. "wbound" is a tuning parameter for this method, which controls the L1 bound on w, and as a result the number of features with non-zero $w_j$ weights. The non-zero elements of w indicate features that are used in the sparse clustering.

We optimize the above criterion with an iterative approach: hold U fixed and optimize with respect to w. Then, hold w fixed and optimize with respect to U.

Note that the arguments described as "Not for use" are included for the sparcl package to function with GenePattern but should be ignored by the R user.

Value

hc

The output of a call to "hclust", giving the results of hierarchical sparse clustering.

ws

The p-vector of feature weights.

u

The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.

dists

The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster.permute,KMeansSparseCluster,KMeansSparseCluster.permute

Examples

# Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*50),ncol=50)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
wbound=perm.out$bestw, method="complete")
  # faster than   sparsehc <- HierarchicalSparseCluster(x=x,wbound=perm.out$bestw, 
#  method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated Data",branchlength=.007)
  # Now, what if we want to see if out data contains a *secondary*
  #   clustering after accounting for the first one obtained. We
  #   look for a complementary sparse clustering:
  sparsehc.comp <- HierarchicalSparseCluster(x,wbound=perm.out$bestw,
     method="complete",uorth=sparsehc$u)
  # Redo the analysis, but this time use "absolute value" dissimilarity:
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
    nperms=5, dissimilarity="absolute.value")
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, 
method="complete",
 dissimilarity="absolute.value")
  par(mfrow=c(1,2))
  plot(sparsehc)

Choose tuning parameter for sparse hierarchical clustering

Description

The tuning parameter controls the L1 bound on w, the feature weights. A permutation approach is used to select the tuning parameter.

Usage

HierarchicalSparseCluster.permute(x, nperms = 10, wbounds = NULL,
dissimilarity=c("squared.distance",
"absolute.value"),standardize.arrays=FALSE)
## S3 method for class 'HierarchicalSparseCluster.permute'
plot(x,...) 
## S3 method for class 'HierarchicalSparseCluster.permute'
print(x,...)

Arguments

x

A nxp data matrix, with n observations and p feaures.

nperms

The number of permutations to perform.

wbounds

The sequence of tuning parameters to consider. The tuning parameters are the L1 bound on w, the feature weights. If NULL, then a default sequence will be used. If non-null, should be greater than 1.

dissimilarity

How should dissimilarity be computed? Default is squared.distance.

standardize.arrays

Should the arrays first be standardized? Default is FALSE.

...

not used.

Details

Let $d_ii'j$ denote the dissimilarity between observations i and i' along feature j.

Sparse hierarchical clustering seeks a p-vector of weights w (one per feature) and a nxn matrix U that optimize $maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1, ||w||_1 <= s, w_j >= 0, sum_ii' U_ii'^2 <= 1$, where s is a value for the L1 bound on w. Let O(s) denote the objective function with tuning parameter s: i.e. $O(s)=sum_j w_j sum_ii' d_ii'j U_ii'$.

We permute the data as follows: within each feature, we permute the observations. Using the permuted data, we can run sparse hierarchical clustering with tuning parameter s, yielding the objective function O*(s). If we do this repeatedly we can get a number of O*(s) values.

Then, the Gap statistic is given by $Gap(s)=log(O(s))-mean(log(O*(s)))$. The optimal s is that which results in the highest Gap statistic. Or, we can choose the smallest s such that its Gap statistic is within $sd(log(O*(s)))$ of the largest Gap statistic.

Value

gaps

The gap statistics obtained (one for each of the tuning parameters tried). If O(s) is the objective function evaluated at the tuning parameter s, and O*(s) is the same quantity but for the permuted data, then Gap(s)=log(O(s))-mean(log(O*(s))).

sdgaps

The standard deviation of log(O*(s)), for each value of the tuning parameter s.

nnonzerows

The number of features with non-zero weights, for each value of the tuning parameter.

wbounds

The tuning parameters considered.

bestw

The value of the tuning parameter corresponding to the highest gap statistic.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster, KMeansSparseCluster, KMeansSparseCluster.permute

Examples

# Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*50),ncol=50)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, 
method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated
Data",branchlength=.007)

A wrapper for the hierarchical sparse clustering algorithm

Description

A wrapper for HierarchicalSparseCluster which reads in the data in GCT file format, and then automatically chooses the optimal tuning parameter value using HierarchicalSparseCluster.permute if not specified.

Usage

HierarchicalSparseCluster.wrapper(file,  method=c("average", "complete", "single",
 "centroid"),
wbound=NULL, silent=FALSE, cluster.features=FALSE,
method.features=c("average", "complete",
"single","centroid"),output.cluster.files=TRUE,outputfile.prefix=NULL,maxnumgenes=5000,
standardize.arrays=TRUE)

Arguments

file

A GCT filename in the working directory containing the data to be clustered.

method

The type of linkage to use in the hierarchical clustering - "single", "complete", "average", or "centroid".

wbound

The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering. If NULL, then it will be chosen via HierarchicalSparseCluster.permute.

silent

Print out progress?

cluster.features

Is a clustering for the features with non-zero weights also desired? Default is FALSE.

method.features

If cluster.features is TRUE, then the type of linkage used to cluster the features with non-zero weights: one of "single", "complete", "average", or "centroid".

output.cluster.files

Should files containing the clustering be output? Default is TRUE.

outputfile.prefix

The prefix for the output files. If NULL, then the prefix of the input file is used.

maxnumgenes

Limit the analysis to some number of genes with highest marginal variance, for computational reasons. This is recommended when the number of genes is very large. If NULL, then all genes are used.

standardize.arrays

Should the arrays first be standardized? Default is TRUE.

Value

hc

The output of a call to "hclust", giving the results of hierarchical sparse clustering.

ws

The p-vector of feature weights.

u

The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.

dists

The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster.permute,KMeansSparseCluster,KMeansSparseCluster.permute


Performs sparse k-means clustering

Description

This function performs sparse k-means clustering. You must specify a number of clusters K and an L1 bound on w, the feature weights.

Usage

KMeansSparseCluster(x, K=NULL, wbounds = NULL, nstart = 20, silent =
FALSE, maxiter=6, centers=NULL)
## S3 method for class 'KMeansSparseCluster'
plot(x,...)
## S3 method for class 'KMeansSparseCluster'
print(x,...)

Arguments

x

An nxp data matrix. There are n observations and p features.

K

The number of clusters desired ("K" in K-means clustering). Must provide either K or centers.

wbounds

A single L1 bound on w (the feature weights), or a vector of L1 bounds on w. If wbound is small, then few features will have non-zero weights. If wbound is large then all features will have non-zero weights. Should be greater than 1.

nstart

The number of random starts for the k-means algorithm.

silent

Print out progress?

maxiter

The maximum number of iterations.

centers

Optional argument. If you want to run the k-means algorithm starting from a particular set of clusters, then you can enter the Kxp matrix of cluster centers here. Default use case involves taking centers=NULL and instead specifying K.

...

not used.

Details

We seek a p-vector of weights w (one per feature) and a set of clusters C1,...,CK that optimize

$maximize_C1,...,CK,w sum_j w_j BCSS_j$ subject to $||w||_2 <= 1, ||w||_1 <= wbound, w_j >= 0$

where $BCSS_j$ is the between cluster sum of squares for feature j. An iterative approach is taken: with w fixed, optimize with respect to C1,...,CK, and with C1,...,CK fixed, optimize with respect to w. Here, wbound is a tuning parameter which determines the L1 bound on w.

The non-zero elements of w indicate features that are used in the sparse clustering.

Value

If wbounds is a vector, then a list with elements as follows (one per element of wbounds). If wbounds is just a single value, then elements as follows:

ws

The p-vector of feature weights.

Cs

The clustering obtained.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

KMeansSparseCluster.permute,HierarchicalSparseCluster

Examples

# generate data
set.seed(11)
x <- matrix(rnorm(50*70),ncol=70)
x[1:25,1:20] <- x[1:25,1:20]+1
x <- scale(x, TRUE, TRUE)
# choose tuning parameter
km.perm <- KMeansSparseCluster.permute(x,K=2,wbounds=seq(3,7,len=15),nperms=5)
print(km.perm)
plot(km.perm)
# run sparse k-means
km.out <- KMeansSparseCluster(x,K=2,wbounds=km.perm$bestw)
print(km.out)
plot(km.out)
# run sparse k-means for a range of tuning parameter values
km.out <- KMeansSparseCluster(x,K=2,wbounds=seq(1.3,4,len=8))
print(km.out)
plot(km.out)
# Run sparse k-means starting from a particular set of cluster centers
#in the k-means algorithm.
km.out <- KMeansSparseCluster(x,wbounds=2:7,centers=x[c(1,3,5),])

Choose tuning parameter for sparse k-means clustering

Description

The tuning parameter controls the L1 bound on w, the feature weights. A permutation approach is used to select the tuning parameter.

Usage

KMeansSparseCluster.permute(x, K=NULL, nperms = 25, wbounds = NULL,
silent = FALSE, nvals = 10, centers=NULL)
## S3 method for class 'KMeansSparseCluster.permute'
print(x,...)
## S3 method for class 'KMeansSparseCluster.permute'
plot(x,...)

Arguments

x

The nxp data matrix, n is the number of observations and p the number of features.

K

The number of clusters desired - that is, the "K" in K-means clustering. Must specify K or centers.

nperms

Number of permutations.

wbounds

The range of tuning parameters to consider. This is the L1 bound on w, the feature weights. If NULL, then a range of values will be chosen automatically. Should be greater than 1 if non-null.

silent

Print out progress?

nvals

If wbounds is NULL, then the number of candidate tuning parameter values to consider.

centers

Optional argument. If you want to run the k-means algorithm starting from a particular set of clusters, then you can enter the Kxp matrix of cluster centers here. Default use case involves taking centers=NULL and instead specifying K.

...

not used.

Details

Sparse k-means clustering seeks a p-vector of weights w (one per feature) and a set of clusters C1,...,CK that optimize $maximize_C1,...,CK,w sum_j w_j BCSS_j$ subject to $||w||_2 <= 1, ||w||_1 <= s, w_j >= 0$, where $BCSS_j$ is the between cluster sum of squares for feature j, and s is a value for the L1 bound on w. Let O(s) denote the objective function with tuning parameter s: i.e. $O(s)=sum_j w_j BCSS_j$.

We permute the data as follows: within each feature, we permute the observations. Using the permuted data, we can run sparse K-means with tuning parameter s, yielding the objective function O*(s). If we do this repeatedly we can get a number of O*(s) values.

Then, the Gap statistic is given by $Gap(s)=log(O(s))-mean(log(O*(s)))$. The optimal s is that which results in the highest Gap statistic. Or, we can choose the smallest s such that its Gap statistic is within $sd(log(O*(s)))$ of the largest Gap statistic.

Value

gaps

The gap statistics obtained (one for each of the tuning parameters tried). If O(s) is the objective function evaluated at the tuning parameter s, and O*(s) is the same quantity but for the permuted data, then Gap(s)=log(O(s))-mean(log(O*(s))).

sdgaps

The standard deviation of log(O*(s)), for each value of the tuning parameter s.

nnonzerows

The number of features with non-zero weights, for each value of the tuning parameter.

wbounds

The tuning parameters considered.

bestw

The value of the tuning parameter corresponding to the highest gap statistic.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

KMeansSparseCluster, HierarchicalSparseCluster, HierarchicalSparseCluster.permute

Examples

# generate data
set.seed(11)
x <- matrix(rnorm(50*70),ncol=70)
x[1:25,1:10] <- x[1:25,1:10]+1.5
x <- scale(x, TRUE, TRUE)
# choose tuning parameter
km.perm <-
KMeansSparseCluster.permute(x,K=2,wbounds=seq(2,5,len=8),nperms=3)
print(km.perm)
plot(km.perm)
# run sparse k-means
km.out <- KMeansSparseCluster(x,K=2,wbounds=km.perm$bestw)
print(km.out)
plot(km.out)
# run sparse k-means for a range of tuning parameter values
km.out <- KMeansSparseCluster(x,K=2,wbounds=2:7)
print(km.out)
plot(km.out)
# Repeat, but this time start with a particular choice of cluster
# centers.
# This will do 4-means clustering starting with this particular choice
# of cluster centers.
km.perm.out <- KMeansSparseCluster.permute(x,wbounds=2:6, centers=x[1:4,],nperms=3)
print(km.out)
plot(km.out)