Title: | Classification and Clustering of Sequencing Data Based on a Poisson Model |
---|---|
Description: | Implements the methods described in the paper, Witten (2011) Classification and Clustering of Sequencing Data using a Poisson Model, Annals of Applied Statistics 5(4) 2493-2518. |
Authors: | Daniela Witten |
Maintainer: | Daniela Witten <[email protected]> |
License: | GPL-2 |
Version: | 1.0.2.1 |
Built: | 2024-12-16 06:42:40 UTC |
Source: | CRAN |
A simple approach for performing classification and clustering of samples for which RNA sequencing data is available. Based upon a simple Poisson model proposed by a number of authors (e.g. Marioni et al Genome Research 2008, Bullard et al BMC Bioinformatics 2010, and others).
Package: | PoiClaClu |
Type: | Package |
Version: | 1.0.2 |
Date: | 2013-01-02 |
License: | GPL-2 |
LazyLoad: | yes |
Daniela Witten
Maintainer: Daniela Witten <[email protected]>
D. Witten (2011) Classification and clustering of sequencing data using a Poisson model. Annals of Applied Statistics 5(4): 2493-2518.
# Poisson clustering # set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=.5,K=4,param=10) dd <- PoissonDistance(dat$x, type="mle") print(dd) ColorDendrogram(hclust(dd$dd), y=dat$y) # Poisson classification # set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=.1,K=4,param=10) out <- Classify(x=dat$x,y=dat$y,xte=dat$xte,rhos=c(0,5,10)) print(out)
# Poisson clustering # set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=.5,K=4,param=10) dd <- PoissonDistance(dat$x, type="mle") print(dd) ColorDendrogram(hclust(dd$dd), y=dat$y) # Poisson classification # set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=.1,K=4,param=10) out <- Classify(x=dat$x,y=dat$y,xte=dat$xte,rhos=c(0,5,10)) print(out)
Classify observations using a simple Poisson model. This function implements the "sparse Poisson linear discriminant analysis classifier", which is similar to linear discriminant analysis but assumes a Poisson model rather than a Gaussian model for the data. The classifier soft-thresholds the estimated effect of each feature in order to achieve sparsity.
Classify(x, y, xte=NULL, rho = 0, beta = 1, rhos = NULL, type=c("mle","deseq","quantile"), prior = NULL, transform=TRUE, alpha=NULL)
Classify(x, y, xte=NULL, rho = 0, beta = 1, rhos = NULL, type=c("mle","deseq","quantile"), prior = NULL, transform=TRUE, alpha=NULL)
x |
A n-by-p training data matrix; n observations and p features. Used to train the classifier. |
y |
A numeric vector of class labels of length n: 1, 2, ...., K if there are K classes. Each element of y corresponds to a row of x; i.e. these are the class labels for the observations in x. |
xte |
A m-by-p data matrix: m test observations and p features. The classifier fit on the training data set x will be tested on this data set. If NULL, then testing will be performed on the training set. |
rho |
Tuning parameter controlling the amount of soft thresholding performed, i.e. the level of sparsity, i.e. number of nonzero features in classifier. Rho=0 means that there is no soft-thresolding, i.e. all features used in classifier. Larger rho means that fewer features will be used. |
beta |
A smoothing term. A Gamma(beta,beta) prior is used to fit the Poisson model. Recommendation is to just leave it at 1, the default value. |
rhos |
A vector of tuning parameters that control the amount of soft thresholding performed. If "rhos" is provided then a number of models will be fit (one for each element of "rhos"), and a number of predicted class labels will be output (one for each element of "rhos"). |
type |
How should the observations be normalized within the Poisson model, i.e. how should the size factors be estimated? Options are "quantile" or "deseq" (more robust) or "mle" (less robust). In greater detail: "quantile" is quantile normalization approach of Bullard et al 2010 BMC Bioinformatics, "deseq" is median of the ratio of an observation to a pseudoreference obtained by taking the geometric mean, described in Anders and Huber 2010 Genome Biology and implemented in Bioconductor package "DESeq", and "mle" is the sum of counts for each sample; this is the maximum likelihood estimate under a simple Poisson model. |
prior |
Vector of length equal to the number of classes, representing prior probabilities for each class. If NULL then uniform priors are used (i.e. each class is equally likely). |
transform |
Should data matrices x and xte first be power transformed so that it more closely fits the Poisson model? TRUE or FALSE. Power transformation is especially useful if the data are overdispersed relative to the Poisson model. |
alpha |
If transform=TRUE, this determines the power to which the data matrices x and xte are transformed. If alpha=NULL then the transformation that makes the Poisson model best fit the data matrix x is computed. (Note that alpha is computed based on x, not based on xte). Or a value of alpha, 0<alpha<=1, can be entered by the user. |
ytehat |
The predicted class labels for each of the test observations (rows of xte). |
discriminant |
A m-by-K matrix, where K is the number of classes. The (i,k) element is large if the ith element of xte belongs to class k. |
ds |
A K-by-p matrix indicating the extent to which each feature is under- or over-expressed in each class. The (k,j) element is >1 if feature j is over-expressed in class k, and is <1 if feature j is under-expressed in class k. When rho is large then many of the elemtns of this matrix are shrunken towards 1 (no over- or under-expression). |
alpha |
Power transformation used (if transform=TRUE). |
Daniela Witten
D Witten (2011) Classification and clustering of sequencing data using a Poisson model. To appear in Annals of Applied Statistics.
set.seed(1) dat <- CountDataSet(n=40,p=500,sdsignal=.1,K=3,param=10) cv.out <- Classify.cv(dat$x,dat$y) print(cv.out) out <- Classify(dat$x,dat$y,dat$xte,rho=cv.out$bestrho) print(out) cat("Confusion matrix for predicted and true test class labels:", fill=TRUE) print(table(out$ytehat,dat$yte))
set.seed(1) dat <- CountDataSet(n=40,p=500,sdsignal=.1,K=3,param=10) cv.out <- Classify.cv(dat$x,dat$y) print(cv.out) out <- Classify(dat$x,dat$y,dat$xte,rho=cv.out$bestrho) print(out) cat("Confusion matrix for predicted and true test class labels:", fill=TRUE) print(table(out$ytehat,dat$yte))
Perform cross-validation for the function that implements the "sparse Poisson linear discriminant analysis classifier", which is similar to linear discriminant analysis but assumes a Poisson model rather than a Gaussian model for the data. The classifier soft-thresholds the estimated effect of each feature in order to achieve sparsity. This cross-validation function selects the proper value of the tuning parameter that controls the level of soft-thresholding.
Classify.cv(x, y, rhos = NULL, beta = 1, nfolds = 5, type=c("mle","deseq","quantile"), folds = NULL, transform=TRUE, alpha=NULL, prior=NULL)
Classify.cv(x, y, rhos = NULL, beta = 1, nfolds = 5, type=c("mle","deseq","quantile"), folds = NULL, transform=TRUE, alpha=NULL, prior=NULL)
x |
A n-by-p training data matrix; n observations and p features. |
y |
A numeric vector of class labels of length n: 1, 2, ...., K if there are K classes. Each element of y corresponds to a row of x; i.e. these are the class labels for the observations in x. |
rhos |
A vector of tuning parameters to try out in cross-validation. Rho controls the level of shrinkage performed, i.e. the number of features that are not involved in the classifier. When rho=0 then all features are involved in the classifier, and when rho is very large no features are involved. If rhos=NULL then a vector of rho values will be chosen automatically. |
beta |
A smoothing term. A Gamma(beta,beta) prior is used to fit the Poisson model. Recommendation is to leave it at 1, the default value. |
nfolds |
The number of folds in the cross-validation; default is 5-fold cross-validation. |
type |
How should the observations be normalized within the Poisson model, i.e. how should the size factors be estimated? Options are "quantile" or "deseq" (more robust) or "mle" (less robust). In greater detail: "quantile" is quantile normalization approach of Bullard et al 2010 BMC Bioinformatics, "deseq" is median of the ratio of an observation to a pseudoreference obtained by taking the geometric mean, described in Anders and Huber 2010 Genome Biology and implemented in Bioconductor package "DESeq", and "mle" is the sum of counts for each sample; this is the maximum likelihood estimate under a simple Poisson model. |
prior |
Vector of length equal to the number of classes, representing prior probabilities for each class. If NULL then uniform priors are used (i.e. each class is equally likely). |
transform |
Should data matrices x and xte first be power transformed so that it more closely fits the Poisson model? TRUE or FALSE. Power transformation is especially useful if the data are overdispersed relative to the Poisson model. |
alpha |
If transform=TRUE, this determines the power to which the data matrices x and xte are transformed. If alpha=NULL then the transformation that makes the Poisson model best fit the data matrix x is computed. (Note that alpha is computed based on x, not based on xte). Or a value of alpha, 0<alpha<=1, can be entered by the user. |
folds |
Instead of specifying the number of folds in cross-validation, one can explicitly specify the folds. To do this, input a list of length r (to perform r-fold cross-validation). The rth element of the list should be a vector containing the indices of the test observations in the rth fold. |
errs |
A matrix of dimension (number of folds)-by-(length of rhos). The (i,j) element is the number of errors occurring in the ith cross-validation fold for the jth value of the tuning parameter, i.e. rhos[j]. |
bestrho |
The tuning parameter value resulting in the lowest overall cross-validation error rate. |
rhos |
The vector of rho values used in cross-validation. |
nnonzero |
A matrix of dimension (number of folds)-by-(length of rhos). The (i,j) element is the number of features included in the classifier occurring in the ith cross-validation fold for the jth value of the tuning paramer. |
folds |
Cross-validation folds used. |
alpha |
Power transformation used (if transform=TRUE). |
Daniela Witten
D Witten (2011) Classification and clustering of sequencing data using a Poisson model. To appear in Annals of Applied Statistics.
set.seed(1) dat <- CountDataSet(n=40,p=500,sdsignal=.1,K=3,param=10) cv.out <- Classify.cv(dat$x,dat$y) print(cv.out) out <- Classify(dat$x,dat$y,dat$xte,rho=cv.out$bestrho) print(out) cat("Confusion matrix comparing predicted class labels to true class labels for training observations:", fill=TRUE) print(table(out$ytehat,dat$yte))
set.seed(1) dat <- CountDataSet(n=40,p=500,sdsignal=.1,K=3,param=10) cv.out <- Classify.cv(dat$x,dat$y) print(cv.out) out <- Classify(dat$x,dat$y,dat$xte,rho=cv.out$bestrho) print(out) cat("Confusion matrix comparing predicted class labels to true class labels for training observations:", fill=TRUE) print(table(out$ytehat,dat$yte))
Pass in the output of "hclust" and a class label for each observation. A colored dendrogram will result, with the leaf colors indicating the classes.
ColorDendrogram(hc, y, main = "", branchlength = 0.7, labels = NULL, xlab = NULL, sub = NULL, ylab = "", cex.main = NULL)
ColorDendrogram(hc, y, main = "", branchlength = 0.7, labels = NULL, xlab = NULL, sub = NULL, ylab = "", cex.main = NULL)
hc |
The output of running "hclust" on a nxn dissimilarity matrix |
y |
A vector of n class labels for the observations that were clustered using "hclust". If labels are numeric from 1 to K, then colors will be determine automatically. Otherwise the labels can take the form of colors (e.g. c("red", "red", "orange", "orange")). |
main |
The main title for the dendrogram. |
branchlength |
How long to make the colored part of the branches. Adjustment will be needed for each dissimilarity matrix |
labels |
The labels for the n observations. |
xlab |
X-axis label. |
sub |
Sub-x-axis label. |
ylab |
Y-axis label. |
cex.main |
The amount by which to enlarge the main title for the figure. |
Daniela Witten
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle") ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle") ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)
Generate two nxp data sets: a training set and a test set, as well as outcome vectors y and yte of length n indicating the class labels of the training and test observations.
CountDataSet(n, p, K, param, sdsignal)
CountDataSet(n, p, K, param, sdsignal)
n |
Number of observations desired. |
p |
Number of features desired. Note that 30% of the features will differ between classes, though some of those differences may be small. |
K |
Number of classes desired. Note that the function requires that n be at least equal to 4K – i.e. there must be at least 4 observations per class on average. |
param |
The dispersion parameter for the negative binomial distribution. The negative binomial distribution is parameterized using "mu" and "size" in the R function "rnbinom". That is, Y ~ NB(mu, param) means that E(Y)=mu and Var(Y) = mu+mu^2/param. So when param is very large this is essentially a Poisson distribution, and when param is smaller then there is a lot of overdispersion relative to the Poisson distribution. |
sdsignal |
The extent to which the classes are different. If this equals zero then there are no class differences and if this is large then the classes are very different. |
This is based in part on a function in the DESeq Bioconductor package (Anders and Huber 2010 Genome Biology) for generating a simulated RNA sequencing data set.
x |
nxq data matrix. May have q<p because features with 0 total counts are removed. |
y |
class labels for the n observations in x. |
xte |
nxq data matrix of test observations; the q features are those with >0 total counts in x. So q<=p. |
yte |
class labels for the n observation in xte. |
Daniela Witten, based on software written by Anders and Huber in the DESeq Bioconductor package.
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle", transform=TRUE)
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle", transform=TRUE)
Find a constant alpha, 0<alpha<=1, such that x raised to the power alpha approximately follows the simple Poisson log linear model that says that the (i,j) element of x is Poisson with mean si times gj, where si is a sample-specific term and gj is a feature-specific term. Alpha is selected via a grid search.
FindBestTransform(x)
FindBestTransform(x)
x |
A n-by-p matrix of sequencing data, with n observations and p features. |
Returns alpha, the power to which x should be raised.
Daniela Witten
D Witten (2011) Classification and clustering of sequencing data using a Poisson model. To appear in Annals of Applied Statistics.
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) alpha <- FindBestTransform(dat$x) # This is the best transformation! dd <- PoissonDistance(dat$x^alpha,type="mle", transform=FALSE) # OR we could get the samething automatically: dd2 <- PoissonDistance(dat$x,type="mle",transform=TRUE) # or like this: dd3 <- PoissonDistance(dat$x,type="mle",transform=TRUE,alpha=alpha) ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) alpha <- FindBestTransform(dat$x) # This is the best transformation! dd <- PoissonDistance(dat$x^alpha,type="mle", transform=FALSE) # OR we could get the samething automatically: dd2 <- PoissonDistance(dat$x,type="mle",transform=TRUE) # or like this: dd3 <- PoissonDistance(dat$x,type="mle",transform=TRUE,alpha=alpha) ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)
This function computes a Poisson dissimilarity matrix as described in the paper referenced below, and is intended to be applied to a data matrix of counts resulting from a sequencing experiment. The (i,k) element of the Poisson dissimilarity matrix is the dissimilarity between observations i and k of the data matrix x: that is, the log likelihood ratio statistic under a simple Poisson model.
PoissonDistance(x, beta = 1, type=c("mle","deseq","quantile"), transform=TRUE, alpha=NULL, perfeature=FALSE)
PoissonDistance(x, beta = 1, type=c("mle","deseq","quantile"), transform=TRUE, alpha=NULL, perfeature=FALSE)
x |
A n-by-p data matrix with observations on the rows, and p features on the columns. The (i,j) element of x is the number of reads in observation i that mapped to feature (e.g. gene or exon) j. |
beta |
A smoothing term; essentially the parameter beta in a Gamma(beta,beta) prior used to estimate the log likelihood ratio statistic for computing the dissimilarity between a pair of observations. Recommended to leave it at 1, the default value. |
type |
How should the observations be normalized within the Poisson model, i.e. how should the size factors be estimated? Options are "quantile" or "deseq" (more robust) or "mle" (less robust). In greater detail: "quantile" is quantile normalization approach of Bullard et al 2010 BMC Bioinformatics, "deseq" is median of the ratio of an observation to a pseudoreference obtained by taking the geometric mean, described in Anders and Huber 2010 Genome Biology and implemented in Bioconductor package "DESeq", and "mle" is the sum of counts for each sample; this is the maximum likelihood estimate under a simple Poisson model. |
transform |
Should data matrix x first be power transformed so that it more closely fits the Poisson model? TRUE or FALSE. Power transformation is especially useful if the data are overdispersed relative to the Poisson model. |
alpha |
If transform=TRUE, this determines the power to which the data matrix x is transformed. If alpha=NULL then the transformation that makes the Poisson model best fit the data is computed. Or a value of alpha, 0<alpha<=1, can be entered by the user. |
perfeature |
If perfeature=TRUE, then in addition to the nxn dissimilarity matrix, a nxnxp array will be returned. Its elements will be the contributions of each of the p features to the nxn dissimilarity matrix; summing over the 3rd index will simply give back the nxn dissimilarity matrix. |
More details can be found in the paper referenced below.
dd |
A nxn Poisson dissimilarity matrix, containing pairwise dissimilarities between observations based on the original nxp data matrix x input by the user. |
alpha |
Power to which data was transformed before computing dissimilarity matrix, if transform was TRUE. This was either input by the user, or computed automatically if not specified. |
x |
Data used to compute dissimilarity matrix, this will be x raised to the power alpha. |
ddd |
If perfeature=TRUE, then this is the nxnxp array containing the contribution of each feature to the nxn dissimilarity matrix. |
Daniela Witten
D Witten (2011) Classification and clustering of sequencing data using a Poisson model. To appear in Annals of Applied Statistics.
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle") print(dd) ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)
set.seed(1) dat <- CountDataSet(n=20,p=100,sdsignal=2,K=4,param=10) dd <- PoissonDistance(dat$x,type="mle") print(dd) ColorDendrogram(hclust(dd$dd), y=dat$y, branchlength=10)