Title: | Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling |
---|---|
Description: | An implementation of multiscale bootstrap resampling for assessing the uncertainty in hierarchical cluster analysis. It provides SI (selective inference) p-value, AU (approximately unbiased) p-value and BP (bootstrap probability) value for each cluster in a dendrogram. |
Authors: | Ryota Suzuki <[email protected]>, Yoshikazu Terada <[email protected]>, Hidetoshi Shimodaira <[email protected]> |
Maintainer: | Ryota Suzuki <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.2-0 |
Built: | 2025-01-10 07:07:40 UTC |
Source: | CRAN |
DNA Microarray data of 73 lung tissues including 67 lung tumors. There are 916 observations of genes for each lung tissue.
data(lung)
data(lung)
data frame of size .
This dataset has been modified from original data. Each one
observation of duplicate genes has been removed. See source
section in this help for original data source.
http://genome-www.stanford.edu/lung_cancer/adeno/
Garber, M. E. et al. (2001) "Diversity of gene expression in adenocarcinoma of the lung", Proceedings of the National Academy of Sciences, 98, 13784-13789.
## Reading the data data(lung) ## Multiscale Bootstrap Resampling lung.pv <- pvclust(lung, nboot=100) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## Plot the result plot(lung.pv, cex=0.8, cex.pv=0.7) ask.bak <- par()$ask par(ask=TRUE) pvrect(lung.pv, alpha=0.9) msplot(lung.pv, edges=c(51,62,68,71)) par(ask=ask.bak) ## Print a cluster with high p-value lung.pp <- pvpick(lung.pv, alpha=0.9) lung.pp$clusters[[2]] ## Print its edge number lung.pp$edges[2] ## We recommend parallel computing for large dataset as this one ## Not run: library(snow) cl <- makeCluster(10, type="MPI") lung.pv <- parPvclust(cl, lung, nboot=1000) ## End(Not run)
## Reading the data data(lung) ## Multiscale Bootstrap Resampling lung.pv <- pvclust(lung, nboot=100) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## Plot the result plot(lung.pv, cex=0.8, cex.pv=0.7) ask.bak <- par()$ask par(ask=TRUE) pvrect(lung.pv, alpha=0.9) msplot(lung.pv, edges=c(51,62,68,71)) par(ask=ask.bak) ## Print a cluster with high p-value lung.pp <- pvpick(lung.pv, alpha=0.9) lung.pp$clusters[[2]] ## Print its edge number lung.pp$edges[2] ## We recommend parallel computing for large dataset as this one ## Not run: library(snow) cl <- makeCluster(10, type="MPI") lung.pv <- parPvclust(cl, lung, nboot=1000) ## End(Not run)
msfit
performs curve fitting for multiscale
bootstrap resampling. It generates an object of class
msfit
. Several generic methods are available.
msfit(bp, r, nboot) ## S3 method for class 'msfit' plot(x, curve=TRUE, main=NULL, sub=NULL, xlab=NULL, ylab=NULL, ...) ## S3 method for class 'msfit' lines(x, col=2, lty=1, ...) ## S3 method for class 'msfit' summary(object, digits=3, ...)
msfit(bp, r, nboot) ## S3 method for class 'msfit' plot(x, curve=TRUE, main=NULL, sub=NULL, xlab=NULL, ylab=NULL, ...) ## S3 method for class 'msfit' lines(x, col=2, lty=1, ...) ## S3 method for class 'msfit' summary(object, digits=3, ...)
bp |
numeric vector of bootstrap probability values. |
r |
numeric vector of relative sample size of bootstrap samples
defined as |
nboot |
numeric value (vector) of the number of bootstrap replications. |
x |
object of class |
curve |
logical. If |
main , sub , xlab , ylab , col , lty
|
generic graphic parameters. |
object |
object of class |
digits |
integer indicating the precision to be used in rounding. |
... |
other parameters to be used in the functions. |
function msfit
performs the curve fitting for multiscale
bootstrap resampling. In package pvclust
this function is only
called from the function pvclust
(or parPvclust
), and
may never be called from users. However one can access a list of
msfit
objects by x$msfit
, where x
is an object of
class pvclust
.
msfit
returns an object of class msfit
. It contains
the following objects:
p |
numeric vector of |
se |
numeric vector of estimated standard errors of |
coef |
numeric vector related to geometric aspects of
hypotheses. |
df |
numeric value of the degree of freedom in curve fitting. |
rss |
residual sum of squares. |
pchi |
|
Ryota Suzuki [email protected]
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.
Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.
draws the results of curve fitting for pvclust object.
msplot(x, edges=NULL, ...)
msplot(x, edges=NULL, ...)
x |
object of class |
edges |
numeric vector of edge numbers to be plotted. |
... |
other parameters to be used in the function. |
Ryota Suzuki [email protected]
plot dendrogram for a pvclust
object and add -values for
clusters.
## S3 method for class 'pvclust' plot(x, print.pv=TRUE, print.num=TRUE, float=0.01, col.pv=c(si=4, au=2, bp=3, edge=8), cex.pv=0.8, font.pv=NULL, col=NULL, cex=NULL, font=NULL, lty=NULL, lwd=NULL, main=NULL, sub=NULL, xlab=NULL, ...) ## S3 method for class 'pvclust' text(x, col=c(au=2, bp=3, edge=8), print.num=TRUE, float=0.01, cex=NULL, font=NULL, ...)
## S3 method for class 'pvclust' plot(x, print.pv=TRUE, print.num=TRUE, float=0.01, col.pv=c(si=4, au=2, bp=3, edge=8), cex.pv=0.8, font.pv=NULL, col=NULL, cex=NULL, font=NULL, lty=NULL, lwd=NULL, main=NULL, sub=NULL, xlab=NULL, ...) ## S3 method for class 'pvclust' text(x, col=c(au=2, bp=3, edge=8), print.num=TRUE, float=0.01, cex=NULL, font=NULL, ...)
x |
object of class |
print.pv |
logical flag to specify whether print |
print.num |
logical flag to specify whether print edge numbers below clusters. |
float |
numeric value to adjust the height of |
col.pv |
named numeric vector to specify the colors for
|
cex.pv |
numeric value which specifies the size of characters for
|
font.pv |
numeric value which specifies the font of characters
for |
col , cex , font
|
in |
lty , lwd , main , sub , xlab , ...
|
generic graphic parameters. See |
This function plots a dendrogram with -values for given object
of class
pvclust
.
SI -value (printed in blue color in default) is the approximately unbiased
-value for selective inference, and
AU
-value (printed in red color in default) is also the approximately unbiased
-value but for non-selective inference. They ared calculated by multiscale bootstrap
resampling. BP value (printed in green color in default) is "bootstrap
probability" value, which is less accurate than AU value as
-value. One can consider that clusters (edges) with high SI or AU
values (e.g. 95%) are strongly supported by data.
SI value is newly introduced in Terada and Shimodaira (2017) for selective inference,
which is more appropriate for testing clusters identified by looking at the tree.
AU value has been used since Shimodaira (2002), which is not designed for selective inference.
AU is valid when you know the clusters before looking at the data.
See also documatation (Multiscale Bootstrap using Scaleboot Package, verison 0.4-0 or higher) in
scaleboot
package.
Ryota Suzuki [email protected]
Terada, Y. and Shimodaira, H. (2007) "Selective inference for the problem of regions via multiscale bootstrap", arXiv:1711.00949.
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.
Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.
print clustering method and distance measure used in
hierarchical clustering, -values and related statistics for
a
pvclust
object.
## S3 method for class 'pvclust' print(x, which=NULL, digits=3, ...)
## S3 method for class 'pvclust' print(x, which=NULL, digits=3, ...)
x |
object of class |
which |
numeric vector which specifies the numbers of edges
(clusters) of which the values are printed. If |
digits |
integer indicating the precision to be used in rounding. |
... |
other parameters used in the function. |
this function prints -values and some related
statistics.
au |
AU (Approximately Unbiased) |
bp |
BP (Bootstrap Probability) value, which is a simple
statistic computed by bootstrap resampling. This value tends to be
biased as |
se.au , se.bp
|
estimated standard errors for |
v , c
|
values related to geometric aspects of
hypotheses. |
pchi |
|
Ryota Suzuki [email protected]
calculates -values for hierarchical clustering via
multiscale bootstrap resampling. Hierarchical clustering is done for
given data and
-values are computed for each of the clusters.
pvclust(data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, parallel=FALSE, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, iseed=NULL, quiet=FALSE) parPvclust(cl=NULL, data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, init.rand=NULL, iseed=NULL, quiet=FALSE)
pvclust(data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, parallel=FALSE, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, iseed=NULL, quiet=FALSE) parPvclust(cl=NULL, data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, init.rand=NULL, iseed=NULL, quiet=FALSE)
data |
numeric data matrix or data frame. |
method.hclust |
the agglomerative method used in hierarchical clustering. This
should be (an abbreviation of) one of |
method.dist |
the distance measure to be used. This should be
a character string, or a function which returns a |
use.cor |
character string which specifies the method for
computing correlation with data including missing values. This
should be (an abbreviation of) one of |
nboot |
the number of bootstrap replications. The default is
|
parallel |
switch for parallel computation.
If |
r |
numeric vector which specifies the relative sample sizes of
bootstrap replications. For original sample size |
store |
locical. If |
cl |
a cluster object created by package parallel or snow. If NULL, use the registered default cluster. |
weight |
logical. If |
init.rand |
logical. If |
iseed |
An integer. If non- |
quiet |
logical. If |
Function pvclust
conducts multiscale bootstrap resampling to calculate
-values for each cluster in the result of hierarchical
clustering.
parPvclust
is the parallel version of this
procedure which depends on package parallel for parallel computation.
For data expressed as matrix or data frame, we
assume that the data is
observations of
objects, which
are to be clustered. The
'th row vector corresponds to the
'th observation of these objects and the
'th column
vector corresponds to a sample of
'th object with size
.
There are several methods to measure the dissimilarities between
objects. For data matrix ,
"correlation"
method takes
for dissimilarity between 'th and
'th object, where
.
"uncentered"
takes uncentered sample correlation
and "abscor"
takes the absolute value of sample correlation
hclust |
hierarchical clustering for original data generated by
function |
edges |
data frame object which contains |
count |
data frame object which contains primitive information about the result of multiscale bootstrap resampling. |
msfit |
list whose elements are results of curve fitting for
multiscale bootstrap resampling, of class |
nboot |
numeric vector of number of bootstrap replications. |
r |
numeric vector of the relative sample size for bootstrap replications. |
store |
list contains bootstrap replications if |
version |
|
Ryota Suzuki [email protected]
Suzuki, R. and Shimodaira, H. (2006) "Pvclust: an R package for assessing the uncertainty in hierarchical clustering", Bioinformatics, 22 (12): 1540-1542.
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.
Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.
Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.
http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
lines.pvclust
, print.pvclust
,
msfit
, plot.pvclust
,
text.pvclust
, pvrect
and
pvpick
.
### example using Boston data in package MASS data(Boston, package = "MASS") ## multiscale bootstrap resampling (non-parallel) boston.pv <- pvclust(Boston, nboot=100, parallel=FALSE) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## plot dendrogram with p-values plot(boston.pv) ask.bak <- par()$ask par(ask=TRUE) ## highlight clusters with high au p-values pvrect(boston.pv) ## print the result of multiscale bootstrap resampling print(boston.pv, digits=3) ## plot diagnostic for curve fitting msplot(boston.pv, edges=c(2,4,6,7)) par(ask=ask.bak) ## print clusters with high p-values boston.pp <- pvpick(boston.pv) boston.pp ### Using a custom distance measure ## Define a distance function which returns an object of class "dist". ## The function must have only one argument "x" (data matrix or data.frame). cosine <- function(x) { x <- as.matrix(x) y <- t(x) %*% x res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y)))) res <- as.dist(res) attr(res, "method") <- "cosine" return(res) } result <- pvclust(Boston, method.dist=cosine, nboot=100) plot(result) ## Not run: ### parallel computation result.par <- pvclust(Boston, nboot=1000, parallel=TRUE) plot(result.par) ## End(Not run)
### example using Boston data in package MASS data(Boston, package = "MASS") ## multiscale bootstrap resampling (non-parallel) boston.pv <- pvclust(Boston, nboot=100, parallel=FALSE) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## plot dendrogram with p-values plot(boston.pv) ask.bak <- par()$ask par(ask=TRUE) ## highlight clusters with high au p-values pvrect(boston.pv) ## print the result of multiscale bootstrap resampling print(boston.pv, digits=3) ## plot diagnostic for curve fitting msplot(boston.pv, edges=c(2,4,6,7)) par(ask=ask.bak) ## print clusters with high p-values boston.pp <- pvpick(boston.pv) boston.pp ### Using a custom distance measure ## Define a distance function which returns an object of class "dist". ## The function must have only one argument "x" (data matrix or data.frame). cosine <- function(x) { x <- as.matrix(x) y <- t(x) %*% x res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y)))) res <- as.dist(res) attr(res, "method") <- "cosine" return(res) } result <- pvclust(Boston, method.dist=cosine, nboot=100) plot(result) ## Not run: ### parallel computation result.par <- pvclust(Boston, nboot=1000, parallel=TRUE) plot(result.par) ## End(Not run)
find clusters with relatively high/low
-values.
pvrect
and lines
(S3 method for class
pvclust
) highlight such clusters in existing plot, and
pvpick
returns a list of such clusters.
pvpick(x, alpha=0.95, pv="au", type="geq", max.only=TRUE) pvrect(x, alpha=0.95, pv="au", type="geq", max.only=TRUE, border=NULL, ...) ## S3 method for class 'pvclust' lines(x, alpha=0.95, pv="au", type="geq", col=2, lwd=2, ...)
pvpick(x, alpha=0.95, pv="au", type="geq", max.only=TRUE) pvrect(x, alpha=0.95, pv="au", type="geq", max.only=TRUE, border=NULL, ...) ## S3 method for class 'pvclust' lines(x, alpha=0.95, pv="au", type="geq", col=2, lwd=2, ...)
x |
object of class |
alpha |
threshold value for |
pv |
character string which specifies the |
type |
one of |
max.only |
logical. If some of clusters with high/low
|
border |
numeric value which specifies the color of borders of rectangles. |
col |
numeric value which specifies the color of lines. |
lwd |
numeric value which specifies the width of lines. |
... |
other graphic parameters to be used. |
pvpick
returns a list which contains the following values.
clusters |
a list of character string vectors. Each vector corresponds to the names of objects in each cluster. |
edges |
numeric vector of edge numbers. The |
Ryota Suzuki [email protected]
draws diagnostic plot for standard error of -value
for pvclust object.
seplot(object, type=c("au", "si", "bp"), identify=FALSE, main=NULL, xlab=NULL, ylab=NULL, ...)
seplot(object, type=c("au", "si", "bp"), identify=FALSE, main=NULL, xlab=NULL, ylab=NULL, ...)
object |
object of class |
type |
the type of |
identify |
logical. If |
main , xlab , ylab
|
generic graphic parameters. See
|
... |
other graphical parameters to be passed to generic
|
Ryota Suzuki [email protected]