Title: | Statistical Significance of Clustering |
---|---|
Description: | SigClust is a statistical method for testing the significance of clustering results. SigClust can be applied to assess the statistical significance of splitting a data set into two clusters. For more than two clusters, SigClust can be used iteratively. |
Authors: | Hanwen Huang, Yufeng Liu & J. S. Marron |
Maintainer: | Hanwen Huang <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.0.1 |
Built: | 2024-11-01 11:45:54 UTC |
Source: | CRAN |
Diagnostics and p-value plots from a sigclust object.
## S4 method for signature 'sigclust,missing' plot(x,y,arg="all",...)
## S4 method for signature 'sigclust,missing' plot(x,y,arg="all",...)
x |
An object of class |
y |
not used |
arg |
Type of the individual plot: "background": make background standard deviation diagnostic plots. These plots contain the raw data points as well as the corresponding density plots using kernel and robust Gaussian fits; "qq": the QQ plot assessing the quality of robust fit of a Gaussian distribution; "diag": make a null distribution covariance estimation diagnostic plot; "pvalue": make a clustering significance pvalue plot; "all": make all above plots (default). |
... |
further arguments for |
SigClust diagnostic plots are suggested to monitor the performance of the SigClust method for a given dataset.
Hanwen Huang: [email protected]; Yufeng Liu: [email protected]; J. S. Marron: [email protected]
Liu, Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the American Statistical Association 103(483) 1281–1293. See also the vignette included with this package.
## Simulate a dataset from a collection of mixtures of two ## multivariate Gaussian distributions with different means. mu <- 5 n <- 30 p <- 500 dat <- matrix(rnorm(p*2*n),2*n,p) dat[1:n,1] <- dat[1:n,1]+mu dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu nsim <- 1000 nrep <- 1 icovest <- 3 pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest) #sigclust plot plot(pvalue)
## Simulate a dataset from a collection of mixtures of two ## multivariate Gaussian distributions with different means. mu <- 5 n <- 30 p <- 500 dat <- matrix(rnorm(p*2*n),2*n,p) dat[1:n,1] <- dat[1:n,1]+mu dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu nsim <- 1000 nrep <- 1 icovest <- 3 pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest) #sigclust plot plot(pvalue)
Perform a significance analysis of clustering. SigClust studies whether clusters are really there, using the 2-means (k = 2) clustering index as a statistic. It assesses the significance of clustering by simulation from a single null Gaussian distribution. Null Gaussian parameters are estimated from the data.
sigclust(x, nsim, nrep=1, labflag=0, label=0, icovest=1)
sigclust(x, nsim, nrep=1, labflag=0, label=0, icovest=1)
x |
A matrix or data.frame of expression data; each row corresponds to a sample and each column to a variable. Data may be properly normalized and may not contain missing values. |
nsim |
Number of simulated Gaussian samples to estimate the distribution of the clustering index for the main p-value computation. |
nrep |
Number of steps to use in 2-means clustering computations (default=1, chosen to optimize speed). This has no effect, unless labflag=0. |
labflag |
An indicator variable specifying if the p-values is for an assigned cluster or for using 2-means; for user assigned clusters labflag=1, otherwise labflag=0. |
label |
If
labflag=0, SigClust uses labels generated by 2-means clustering. If
labflag=1, label needs to be set as a numeric, integer vector of 1s and
2s with length |
icovest |
Covariance estimation type: 1. Use a soft threshold method as constrained MLE (default); 2. Use sample covariance estimate (recommended when diagnostics fail); 3. Use original background noise thresholded estimate (from Liu, et al, (2008)) ("hard thresholding"). |
The SigClust method addresses the problem of assessing
statistical significance of clustering as a testing procedure. The
null hypothesis of SigClust
is that the data are from a single
Gaussian distribution. The signicance of a given clustering is judged
by calculating an appropriate p-value. The SigClust method uses a test
statistic called the cluster index (CI) which is defined to be the sum
of within-class sums of squares about the mean divided by the total
sum of squares about the overall mean. The null distribution of the CI
can be approximated by simulating from a single Gaussian distribution,
fit to the data. Because CI is mean shift invariant, it is enough to
take the mean to be 0. Because CI is rotation invariant, we take the
covariance to be diagonal. There are three options for estimating the
eigenvalues of the covariance matrix: 1. Soft Thresholding
(recommended for high dimensions, when the diagnostics indicate
assumptions are met). 2. Sample eigenvalues (recommended for low
dimensions, and when assumptions, such as Gaussianity fail, but known
to be generally conservative). 3. Hard Thresholding.
The function returns an object of class sigclust
. See
help for sigclust-class
for more details.
Hanwen Huang: [email protected]; Yufeng Liu: [email protected]; J. S. Marron: [email protected]
Liu, Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the American Statistical Association 103(483) 1281–1293.
## Simulate a dataset from a collection of mixtures of two ## multivariate Gaussian distribution with different means. mu <- 5 n <- 30 p <- 500 dat <- matrix(rnorm(p*2*n),2*n,p) dat[1:n,1] <- dat[1:n,1]+mu dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu nsim <- 1000 nrep <- 1 icovest <- 3 pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest) #sigclust plot plot(pvalue)
## Simulate a dataset from a collection of mixtures of two ## multivariate Gaussian distribution with different means. mu <- 5 n <- 30 p <- 500 dat <- matrix(rnorm(p*2*n),2*n,p) dat[1:n,1] <- dat[1:n,1]+mu dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu nsim <- 1000 nrep <- 1 icovest <- 3 pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest) #sigclust plot plot(pvalue)
The class sigclust is the output from the function
sigclust
. It is also the input to the plot function
plot-methods
.
raw.data
:raw data matrix.
veigval
:vector of sample eigen values.
vsimeigval
:vector of eigen values used in simulation
simbackvar
:background variance fit from the data.
icovest
:covariance estimation type.
nsim
:number of simulated Gaussian samples.
simcindex
:vector of cluster indices based on nsim simulated data sets.
pval
:simulated sigclust p-value based on empiriccal quantiles.
pvalnorm
:simulated sigclust p-value based on Gaussian quantiles.
xcindex
:cluster index based on given data set.
Hanwen Huang: [email protected]; Yufeng Liu: [email protected]; J. S. Marron: [email protected]