Title: | Statistical Approach to Outlier Detection in RNA-Seq and Related Data |
---|---|
Description: | An approach to outlier detection in RNA-seq and related data based on five statistics. 'OutSeekR' implements an outlier test by comparing the distributions of these statistics in observed data with those of simulated null data. |
Authors: | Jee Yun Han [aut], John Sahrmann [aut], Jaron Arbet [ctb], Paul Boutros [aut, cre, cph] |
Maintainer: | Paul Boutros <[email protected]> |
License: | GPL-2 |
Version: | 1.0.0 |
Built: | 2024-11-20 08:01:55 UTC |
Source: | CRAN |
Calculate p-values for each sample of a single transcript.
calculate.p.values( x, x.distribution, x.zrange.mean, x.zrange.median, x.zrange.trimmean, x.fraction.kmeans, x.cosine.similarity, null.zrange.mean, null.zrange.median, null.zrange.trimmean, null.fraction.kmeans, null.cosine.similarity, kmeans.nstart = 1 )
calculate.p.values( x, x.distribution, x.zrange.mean, x.zrange.median, x.zrange.trimmean, x.fraction.kmeans, x.cosine.similarity, null.zrange.mean, null.zrange.median, null.zrange.trimmean, null.fraction.kmeans, null.cosine.similarity, kmeans.nstart = 1 )
x |
A numeric vector of values for an observed transcript. |
x.distribution |
A numeric code corresponding to the optimal distribution of |
x.zrange.mean |
A number, the range of the z-scores calculated using the mean and standard deviation of |
x.zrange.median |
A number, the range of the z-scores calculated using the median and median absolute deviation of |
x.zrange.trimmean |
A number, the range of the z-scores calculated using the trimmed mean and trimmed standard deviation of |
x.fraction.kmeans |
A number, the k-means fraction of |
x.cosine.similarity |
A number, the cosine similarity of |
null.zrange.mean |
A numeric vector, the ranges of the z-scores calculated using the mean and standard deviation of each transcript in the null data. |
null.zrange.median |
A numeric vector, the ranges of the z-scores calculated using the median and median absolute deviation of each transcript in the null data. |
null.zrange.trimmean |
A numeric vector, the ranges of the z-scores calculated using the trimmed mean and trimmed standard deviation of each transcript in the null data. |
null.fraction.kmeans |
A numeric vector, the k-means fraction of each transcript in the null data. |
null.cosine.similarity |
A numeric vector, the cosine similarity of each transcript in the null data. |
kmeans.nstart |
The number of random starts when computing k-means fraction; default is 1. See |
A list consisting of the following entries:
p.values
: a vector of p-values for the outlier test run on each sample (up until the p-value exceeds p.value.threshold
); and
outlier.statistics.list
, a list of vectors containing the values of the outlier statistics calculated from the remaining samples. The list will be of length equal to one plus the total number of outliers (i.e., the number of samples with an outlier test p-value less than p.value.threshold
) and will contain entries outlier.statistics.N
, where N
is between zero and the total number of outliers. outlier.statistics.N
is the vector of outlier statistics after excluding the N
th outlier sample, with outlier.statistics.0
being for the complete transcript.
data(example.data.for.calculate.p.values); i <- 1; # row index of transcript to test calculate.p.values( x = example.data.for.calculate.p.values$data[i,], x.distribution = example.data.for.calculate.p.values$x.distribution[i], x.zrange.mean = example.data.for.calculate.p.values$x.zrange.mean[i], x.zrange.median = example.data.for.calculate.p.values$x.zrange.median[i], x.zrange.trimmean = example.data.for.calculate.p.values$x.zrange.trimmean[i], x.fraction.kmeans = example.data.for.calculate.p.values$x.fraction.kmeans[i], x.cosine.similarity = example.data.for.calculate.p.values$x.cosine.similarity[i], null.zrange.mean = example.data.for.calculate.p.values$null.zrange.mean, null.zrange.median = example.data.for.calculate.p.values$null.zrange.median, null.zrange.trimmean = example.data.for.calculate.p.values$null.zrange.trimmean, null.fraction.kmeans = example.data.for.calculate.p.values$null.fraction.kmeans, null.cosine.similarity = example.data.for.calculate.p.values$null.cosine.similarity, kmeans.nstart = example.data.for.calculate.p.values$kmeans.nstart );
data(example.data.for.calculate.p.values); i <- 1; # row index of transcript to test calculate.p.values( x = example.data.for.calculate.p.values$data[i,], x.distribution = example.data.for.calculate.p.values$x.distribution[i], x.zrange.mean = example.data.for.calculate.p.values$x.zrange.mean[i], x.zrange.median = example.data.for.calculate.p.values$x.zrange.median[i], x.zrange.trimmean = example.data.for.calculate.p.values$x.zrange.trimmean[i], x.fraction.kmeans = example.data.for.calculate.p.values$x.fraction.kmeans[i], x.cosine.similarity = example.data.for.calculate.p.values$x.cosine.similarity[i], null.zrange.mean = example.data.for.calculate.p.values$null.zrange.mean, null.zrange.median = example.data.for.calculate.p.values$null.zrange.median, null.zrange.trimmean = example.data.for.calculate.p.values$null.zrange.trimmean, null.fraction.kmeans = example.data.for.calculate.p.values$null.fraction.kmeans, null.cosine.similarity = example.data.for.calculate.p.values$null.cosine.similarity, kmeans.nstart = example.data.for.calculate.p.values$kmeans.nstart );
Calculate residuals between quantiles of the input and quantiles of one of four distributions: normal, log-normal, exponential, or gamma.
calculate.residuals(x, distribution)
calculate.residuals(x, distribution)
x |
A numeric vector. |
distribution |
A number corresponding to the optimal distribution of
|
A numeric vector of the same length as x
. Names are not retained.
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); names(x) <- paste( 'Sample', seq_along(x), sep = '.' ); calculate.residuals( x = x, distribution = 4 );
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); names(x) <- paste( 'Sample', seq_along(x), sep = '.' ); calculate.residuals( x = x, distribution = 4 );
Detect outliers in normalized RNA-seq data.
detect.outliers( data, num.null = 1000, initial.screen.method = c("fdr", "p.value"), p.value.threshold = 0.05, fdr.threshold = 0.01, kmeans.nstart = 1 )
detect.outliers( data, num.null = 1000, initial.screen.method = c("fdr", "p.value"), p.value.threshold = 0.05, fdr.threshold = 0.01, kmeans.nstart = 1 )
data |
A matrix or data frame of normalized RNA-seq data, organized with transcripts on rows and samples on columns. Transcript identifiers should be stored as |
num.null |
The number of transcripts to generate when simulating from null distributions; default is 1000. We recommend using at least 10,000 iterations for publication-level results, with 100,000 or even one million iterations providing more robust estimates. |
initial.screen.method |
The statistical criterion for initial gene selection; valid options are 'FDR' and 'p-value'. |
p.value.threshold |
The p-value threshold for the outlier test; default is 0.05. Once the p-value for a sample exceeds |
fdr.threshold |
The false discovery rate (FDR)-adjusted p-value threshold for determining the final count of outliers; default is 0.01. |
kmeans.nstart |
The number of random starts when computing k-means fraction; default is 1. See |
A list consisting of the following entries:
p.values
: a matrix of unadjusted p-values for the outlier test run on each transcript in data
.
fdr
: a matrix of FDR-adjusted p-values for the outlier test run on each transcript in data
.
num.outliers
: a vector giving the number of outliers detected for each transcript based on the threshold.
outlier.test.results.list
: a list of length max(num.outliers) + 1
containing entries roundN
, where N
is between one and max(num.outliers) + 1
. roundN
is the data frame of results for the outlier test after excluding the (N-1)th outlier sample, with round1
being for the original data set (i.e., before excluding any outlier samples).
distributions
: a numeric vector indicating the optimal distribution for each transcript. Possible values are 1 (normal), 2 (log-normal), 3 (exponential), and 4 (gamma).
initial.screen.method
: Specifies the statistical criterion for initial feature selection. Valid options are 'p-value' and 'FDR' (p-value used by default).
data(outliers); outliers.subset <- outliers[1:10,]; results <- detect.outliers( data = outliers.subset, num.null = 10 );
data(outliers); outliers.subset <- outliers[1:10,]; results <- detect.outliers( data = outliers.subset, num.null = 10 );
Example data (list object) for testing calculate.p.values()
.
example.data.for.calculate.p.values
example.data.for.calculate.p.values
An object of class list
of length 13.
Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given data according to BIC.
## S3 method for class 'bic.optimal.data.distribution' identify(x)
## S3 method for class 'bic.optimal.data.distribution' identify(x)
x |
A numeric vector. |
A numeric code representing which distribution optimally fits x
. Possible values are
1 = normal,
2 = log-normal,
3 = exponential, and
4 = gamma.
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); identify.bic.optimal.data.distribution( x = x );
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); identify.bic.optimal.data.distribution( x = x );
Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given vector of residuals according to BIC.
## S3 method for class 'bic.optimal.residuals.distribution' identify(x)
## S3 method for class 'bic.optimal.residuals.distribution' identify(x)
x |
A numeric vector. |
A numeric code representing which distribution optimally fits x
. Possible values are
1 = normal,
2 = log-normal,
3 = exponential, and
4 = gamma.
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); identify.bic.optimal.residuals.distribution( x = x );
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); identify.bic.optimal.residuals.distribution( x = x );
Given a vector of cluster assigments from quantify.outliers()
run with method = 'kmeans'
, compute the fraction of observations belonging to the smaller of the two clusters.
kmeans.fraction(x)
kmeans.fraction(x)
x |
A numeric vector. |
This function only considers clusters 1 and 2 even if quantify.outliers()
was run with exclude.zero = TRUE
. In that case, zeros are effectively excluded from the counts used to define the k-means fraction. See examples.
A number.
x <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2); names(x) <- letters[1:length(x)]; kmeans.fraction(x);
x <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2); names(x) <- letters[1:length(x)]; kmeans.fraction(x);
Compute cosine similarity for detection of outliers. Generate theoretical quantiles based on the optimal distribution of the data, and compute cosine similarity between a point made up of the largest observed quantile and the largest theoretical quantile and a point on the line y = x. .
outlier.detection.cosine(x, distribution)
outlier.detection.cosine(x, distribution)
x |
A numeric vector. |
distribution |
A numeric code corresponding to the optimal distribution of |
A number.
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); outlier.detection.cosine( x = x, distribution = 4 );
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); outlier.detection.cosine( x = x, distribution = 4 );
Example data set for outlier testing
outliers
outliers
A data frame with 500 rows and 50 columns:
simulated fragments per kilobase of transcript per million fragments mapped (FPKM) values for sample 1
simulated FPKM values for sample 2
simulated FPKM values for sample 3
simulated FPKM values for sample 4
simulated FPKM values for sample 5
simulated FPKM values for sample 6
simulated FPKM values for sample 7
simulated FPKM values for sample 8
simulated FPKM values for sample 9
simulated FPKM values for sample 10
simulated FPKM values for sample 11
simulated FPKM values for sample 12
simulated FPKM values for sample 13
simulated FPKM values for sample 14
simulated FPKM values for sample 15
simulated FPKM values for sample 16
simulated FPKM values for sample 17
simulated FPKM values for sample 18
simulated FPKM values for sample 19
simulated FPKM values for sample 20
simulated FPKM values for sample 21
simulated FPKM values for sample 22
simulated FPKM values for sample 23
simulated FPKM values for sample 24
simulated FPKM values for sample 25
simulated FPKM values for sample 26
simulated FPKM values for sample 27
simulated FPKM values for sample 28
simulated FPKM values for sample 29
simulated FPKM values for sample 30
simulated FPKM values for sample 31
simulated FPKM values for sample 32
simulated FPKM values for sample 33
simulated FPKM values for sample 34
simulated FPKM values for sample 35
simulated FPKM values for sample 36
simulated FPKM values for sample 37
simulated FPKM values for sample 38
simulated FPKM values for sample 39
simulated FPKM values for sample 40
simulated FPKM values for sample 41
simulated FPKM values for sample 42
simulated FPKM values for sample 43
simulated FPKM values for sample 44
simulated FPKM values for sample 45
simulated FPKM values for sample 46
simulated FPKM values for sample 47
simulated FPKM values for sample 48
simulated FPKM values for sample 49
simulated FPKM values for sample 50
Compute quantities for use in the detection of outliers. Specifically, compute z-scores based on the mean / standard deviation, the trimmed mean / trimmed standard deviation, or the median / median absolute deviation, or the cluster assignment from k-means with two clusters.
quantify.outliers( x, method = "mean", trim = 0, nstart = 1, exclude.zero = FALSE )
quantify.outliers( x, method = "mean", trim = 0, nstart = 1, exclude.zero = FALSE )
x |
A numeric vector. |
method |
A string indicating the quantities to be computed. Possible values are
|
trim |
A number, the fraction of observations to be trimmed from each end of |
nstart |
A number, for k-means clustering, the number of random initial centers for the clusters. Default is |
exclude.zero |
A logical, whether zeros should be excluded ( |
A numeric vector the same size as x
whose values are the requested quantities computed on the corresponding elements of x
.
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); # Add missing values and zeros for demonstration. Missing values are # ignored, and zeros can be ignored with `exclude.zeros = TRUE`. x[1:5] <- NA; x[6:10] <- 0; # Compute z-scores based on mean and standard deviation. quantify.outliers( x = x, method = 'mean', trim = 0 ); # Exclude zeros from the calculation of the mean and standard # deviation. quantify.outliers( x = x, method = 'mean', trim = 0, exclude.zero = TRUE ); # Compute z-scores based on the 5% trimmed mean and 5% trimmed # standard deviation. quantify.outliers( x = x, method = 'mean', trim = 0.05 ); # Compute z-scores based on the median and median absolute deviation. quantify.outliers( x = x, method = 'median' ); # Compute cluster assignments using k-means with k = 2. quantify.outliers( x = x, method = 'kmeans' ); # Try different initial cluster assignments. quantify.outliers( x = x, method = 'kmeans', nstart = 10 ); # Assign zeros to their own cluster. quantify.outliers( x = x, method = 'kmeans', exclude.zero = TRUE );
# Generate fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); # Add missing values and zeros for demonstration. Missing values are # ignored, and zeros can be ignored with `exclude.zeros = TRUE`. x[1:5] <- NA; x[6:10] <- 0; # Compute z-scores based on mean and standard deviation. quantify.outliers( x = x, method = 'mean', trim = 0 ); # Exclude zeros from the calculation of the mean and standard # deviation. quantify.outliers( x = x, method = 'mean', trim = 0, exclude.zero = TRUE ); # Compute z-scores based on the 5% trimmed mean and 5% trimmed # standard deviation. quantify.outliers( x = x, method = 'mean', trim = 0.05 ); # Compute z-scores based on the median and median absolute deviation. quantify.outliers( x = x, method = 'median' ); # Compute cluster assignments using k-means with k = 2. quantify.outliers( x = x, method = 'kmeans' ); # Try different initial cluster assignments. quantify.outliers( x = x, method = 'kmeans', nstart = 10 ); # Assign zeros to their own cluster. quantify.outliers( x = x, method = 'kmeans', exclude.zero = TRUE );
Simulate transcripts from a specified null distribution.
## S3 method for class 'null' simulate(x, x.distribution, r, r.distribution)
## S3 method for class 'null' simulate(x, x.distribution, r, r.distribution)
x |
A numeric vector of transcripts. |
x.distribution |
A numeric code corresponding to the optimal distribution of
|
r |
A numeric vector of residuals calculated for this transcript. |
r.distribution |
A numeric code corresponding to the optimal distribution of |
A numeric vector of the same length as x
. Names are not retained.
# Prepare fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); names(x) <- paste('Sample', seq_along(x), sep = '.'); x.dist <- identify.bic.optimal.data.distribution( x = x ); r <- calculate.residuals( x = x, distribution = x.dist ); r.trimmed <- trim.sample( x = r ); r.dist <- identify.bic.optimal.residuals.distribution( x = r.trimmed ); null <- simulate.null( x = x, x.distribution = x.dist, r = r.trimmed, r.distribution = r.dist );
# Prepare fake data. set.seed(1234); x <- rgamma( n = 20, shape = 2, scale = 2 ); names(x) <- paste('Sample', seq_along(x), sep = '.'); x.dist <- identify.bic.optimal.data.distribution( x = x ); r <- calculate.residuals( x = x, distribution = x.dist ); r.trimmed <- trim.sample( x = r ); r.dist <- identify.bic.optimal.residuals.distribution( x = r.trimmed ); null <- simulate.null( x = x, x.distribution = x.dist, r = r.trimmed, r.distribution = r.dist );
Symmetrically trim a vector of numbers after sorting it.
trim.sample(x, trim = 0.05)
trim.sample(x, trim = 0.05)
x |
A numeric vector. |
trim |
A number, the fraction of observations to be trimmed from each end of |
If length(x) <= 10
, the function returns x[2:(length(x) - 1)]
.
A sorted, trimmed copy of x
.
trim.sample( x = 1:20, trim = 0.05 );
trim.sample( x = 1:20, trim = 0.05 );
Compute the range of a vector of z-scores.
zrange(x)
zrange(x)
x |
A numeric vector |
A number.
set.seed(1234); x <- rnorm( n = 10 ); zrange( x = x );
set.seed(1234); x <- rnorm( n = 10 ); zrange( x = x );