--- title: "The Beta-Binomial Test for Count Data" author: "Thang V Pham" date: "2021-01-24" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{countdata: The Beta-Binomial Test for Count Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, eval = FALSE, comment = "#>" ) ``` A number of technology platforms in proteomics and genomics produce count data for quantitative analysis. In proteomics, the number of MS/MS events observed for a protein in a mass spectrometry experiment has been shown to correlate strongly with the protein's abundance. In genomics, next-generation sequencing technologies use read counts as a measure of the abundance of the target transcripts. The R package _countdata_ contains functions for statistical significance analysis of count data for both paired and unpaired designs. ## Quick start The user needs to install the package once, most likely by entering `install.packages("countdata")` in the R console, and load the package before use. ```{r setup} library(countdata) ``` The following are three complete examples, from reading input data in a tab-deliminated text format to writing the result to a local text file `output.txt`. The output column `pval` contains the p-values of the test and the column `pval.BH` contains the p-values adjusted for multiple testing using the Benjamini-Hochberg method (Benjamini \& Hochberg, 1995). **Two-group beta-binomial test** (Pham et al., Bioinformatics 2010) ```{r} d <- read.delim("https://tvpham.github.io/data/example-3groups.txt") head(d) #> a1 a2 a3 b1 b2 b3 c1 c2 #> 1 624 496 509 414 394 375 325 288 #> 2 615 854 930 341 523 360 359 329 #> 3 553 560 745 819 490 481 480 500 #> 4 525 412 401 354 321 310 258 228 #> 5 484 284 315 268 282 307 270 298 #> 6 482 348 400 242 365 367 81 118 # compare the first 3 samples against the next three samples out <- countdata::bb.test(d[, 1:6], colSums(d[, 1:6]), c(rep("a", 3), rep("b", 3))) #> Using 11 thread(s) ... #> No. of data rows = 1786, no. of groups = 2, no. of samples = 6... #> 5% #> 11% #> 18% #> 23% #> 30% #> 35% #> 41% #> 49% #> 55% #> 64% #> 81% #> 87% #> 93% #> 98% #> Done. d.norm <- countdata::normalize(d[, 1:8]) write.table(cbind(d, d.norm, fc = countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6]), pval = out$p.value, pval.BH = p.adjust(out$p.value, method = "BH")), file = "output.txt", row.names = FALSE, sep = "\t") ``` **Three-group beta-binomial test** (Pham et al., Bioinformatics 2010) ```{r} d <- read.delim("https://tvpham.github.io/data/example-3groups.txt") head(d) #> a1 a2 a3 b1 b2 b3 c1 c2 #> 1 624 496 509 414 394 375 325 288 #> 2 615 854 930 341 523 360 359 329 #> 3 553 560 745 819 490 481 480 500 #> 4 525 412 401 354 321 310 258 228 #> 5 484 284 315 268 282 307 270 298 #> 6 482 348 400 242 365 367 81 118 # compare the first 3 samples, the next three samples, and the last two samples. out <- countdata::bb.test(d[, 1:8], colSums(d[, 1:8]), c(rep("a", 3), rep("b", 3), rep("c", 2))) #> Using 11 thread(s) ... #> No. of data rows = 1786, no. of groups = 3, no. of samples = 8... #> 0% #> 5% #> 10% #> 16% #> 23% #> 33% #> 40% #> 45% #> 51% #> 58% #> 63% #> 70% #> 76% #> 81% #> 87% #> 93% #> 98% #> Done. d.norm <- countdata::normalize(d[, 1:8]) write.table(cbind(d, d.norm, pval = out$p.value, pval.BH = p.adjust(out$p.value, method = "BH")), file = "output.txt", row.names = FALSE, sep = "\t") ``` **Two-group paired beta-binomial test** (Pham \& Jimenez, Bioinformatics 2012) ```{r} d <- read.delim("https://tvpham.github.io/data/example-paired.txt") head(d) #> pre.1 pre.2 pre.3 post.1 post.2 post.3 #> 1 575 179 335 505 172 204 #> 2 294 245 256 396 390 265 #> 3 293 282 320 372 240 204 #> 4 303 282 250 307 243 227 #> 5 396 271 171 327 216 103 #> 6 238 261 271 245 234 215 out <- countdata::ibb.test(d[, 1:6], colSums(d[, 1:6]), c(rep("pre_treatment", 3), rep("post_treatment", 3))) #> Using 11 thread(s) ... #> No. of data rows = 2919, no. of pair(s) = 3... #> 0% #> 5% #> 11% #> 16% #> 21% #> 26% #> 31% #> 36% #> 41% #> 46% #> 51% #> 57% #> 63% #> 68% #> 73% #> 78% #> 83% #> 88% #> 94% #> 99% #> Done. d.norm <- countdata::normalize(d[, 1:6]) write.table(cbind(d, d.norm, fc = out$fc, pval = out$p.value, pval.BH = p.adjust(out$p.value, method = "BH")), file = "output.txt", row.names = FALSE, sep = "\t") ``` ## Unpaired test The beta-binomial test (Pham et al., Bioinformatics 2010) is used for significance analysis of independent samples for two or more groups. Suppose that a vector `x` contains the count numbers and a vector `tx` the corresponding normalization sizes, for example the total spectral counts per sample in a proteomics experiment. In addition, the group information is specified in a vector `group`. We perform a beta-binomial test as follows. ```{r} x <- c(1, 5, 1, 10, 9, 11, 2, 8) tx <- c(19609, 19053, 19235, 19374, 18868, 19018, 18844, 19271) group <- c(rep("cancer", 3), rep("normal", 5)) countdata::bb.test(x, tx, group) #> Using a single CPU core ... #> No. of data rows = 1, no. of groups = 2, no. of samples = 8... #> 100% #> Done. #> $p.value #> [1] 0.01568598 ``` The test can be applied to a data matrix row by row. The following example compares three groups, the first group consisting of columns 1 to 3 of a data matrix, the second columns 4 to 6, and the third columns 7 and 8. ```{r} d <- read.delim("https://tvpham.github.io/data/example-3groups.txt") # compare 3 groups, using all available CPU cores out <- countdata::bb.test(d[, 1:8], colSums(d[, 1:8]), c(rep("a", 3), rep("b", 3), rep("c", 2))) #> Using 11 thread(s) ... #> No. of data rows = 1786, no. of groups = 3, no. of samples = 8... #> 0% #> 5% #> 11% #> 16% #> 21% #> 28% #> 34% #> 40% #> 46% #> 52% #> 58% #> 77% #> 83% #> 88% #> 93% #> 98% #> Done. ``` The output `out$p.value` contains p-values for each row of `d`. For multiple testing correction, use the `p.adjust` function in R. For example, to apply the Benjamini & Hochberg method for p-value adjustment ```{r} pval.BH <- p.adjust(out$p.value, method = "BH") ``` ### Data normalization The beta-binomial test takes **raw counts as input** and normalizes the values internally. To obtain normalized values, we multiply each sample by a scaling factor so that the total counts are equal across samples. This procedure is implemented in the `normalize` function. ```{r} d.norm <- countdata::normalize(d[, 1:8]) # check -- all values should be equal colSums(d.norm) #> a1 a2 a3 b1 b2 b3 c1 c2 #> 19159 19159 19159 19159 19159 19159 19159 19159 ``` The matrix `d.norm` contains the normalized data. We can combine the normalized data and the result of the beta-binomial test by `cbind(d.norm, out$p.value)`, and subsequently save or view the resulting matrix ```{r} head(cbind(d.norm, out$p.value)) #> a1 a2 a3 b1 b2 b3 c1 c2 #> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262 #> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879 #> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941 #> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749 #> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681 #> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209 82.35401 117.3142 #> out$p.value #> 1 0.0001164498 #> 2 0.0016150240 #> 3 0.4501729545 #> 4 0.0003105657 #> 5 0.2521494212 #> 6 0.0001057092 ``` ### Fold change calculation The package provides a convenient function to calculate the ratio between the averages of two sample groups row by row. For example, to calculate the fold change between all group pairs ```{r} fc12 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6]) fc13 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 7:8]) fc23 <- countdata::fold.change(d.norm[, 4:6], d.norm[, 7:8], BIG = 100) head(cbind(d.norm, out$p.value, fc12, fc13, fc23)) #> a1 a2 a3 b1 b2 b3 c1 c2 #> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262 #> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879 #> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941 #> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749 #> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681 #> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209 82.35401 117.3142 #> out$p.value fc12 fc13 fc23 #> 1 0.0001164498 -1.360633 -1.746148 -1.283335 #> 2 0.0016150240 -1.938309 -2.298320 -1.185735 #> 3 0.4501729545 -1.029825 -1.248907 -1.212738 #> 4 0.0003105657 -1.342337 -1.808716 -1.347438 #> 5 0.2521494212 -1.245834 -1.252351 -1.005232 #> 6 0.0001057092 -1.244604 -4.071068 -3.270976 ``` Note that a positive fold change value means up-regulation where the average of the second group is higher than that of the first group. Conversely, a negative value means down-regulation where the average of the first group is higher than that of the second. If one group contains all zeros, a positive or negative big value is returned (default BIG = 10000). ## Paired test The inverted beta-binomial test (Pham \& Jimenez, Bioinformatics 2012) is used for paired sample testing, for example between pre-treatment and post-treatment data. The following is an example of a paired test. ```{r} x <- c(33, 32, 86, 51, 52, 149) tx <- c(7742608, 15581382, 20933491, 7126839, 13842297, 14760103) group <- c(rep("cancer", 3), rep("normal", 3)) countdata::ibb.test(x, tx, group) #> Using a single CPU core ... #> No. of data rows = 1, no. of pair(s) = 3... #> 100% #> Done. #> $p.value #> [1] 0.004103636 #> #> $fc #> [1] 2.137632 ``` Analogously to the unpaired situation, the test can be performed for a data matrix row by row. The following example compares two groups where columns 1 to 3 are respectively paired with columns 4 to 6. ```{r} d <- read.delim("https://tvpham.github.io/data/example-paired.txt") # perform a paired test for all rows out <- countdata::ibb.test(d[, 1:6], colSums(d[, 1:6]), c(rep("pre_treatment", 3), rep("post_treatment", 3))) #> Using 11 thread(s) ... #> No. of data rows = 2919, no. of pair(s) = 3... #> 0% #> 5% #> 10% #> 16% #> 21% #> 26% #> 31% #> 37% #> 42% #> 47% #> 52% #> 57% #> 62% #> 67% #> 73% #> 78% #> 83% #> 88% #> 93% #> 98% #> Done. ``` The result `out` is a list of two elements where `p.value` is the p-value of the test and `fc` an estimation of the common fold change. We can calculate the normalized data and write out the result as follows. ```{r} d.norm <- countdata::normalize(d[, 1:6]) head(cbind(d.norm, out$p.value, out$fc)) #> pre.1 pre.2 pre.3 post.1 post.2 post.3 out$p.value out$fc #> 1 594.2861 176.8823 347.1786 490.0689 164.2960 208.5462 0.064856368 -1.297663 #> 2 303.8611 242.1015 265.3067 384.2916 372.5316 270.9056 0.072525346 1.259570 #> 3 302.8275 278.6637 331.6333 361.0012 229.2502 208.5462 0.334902070 -1.168634 #> 4 313.1629 278.6637 259.0885 297.9231 232.1159 232.0588 0.045979358 -1.117328 #> 5 409.2823 267.7939 177.2166 317.3317 206.3252 105.2954 0.006665643 -1.356184 #> 6 245.9828 257.9122 280.8520 237.7562 223.5190 219.7913 0.048216659 -1.151104 ``` ## References 1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. _Journal of the Royal Statistical Society Series B_ 57, 289–300. 1. Pham TV, Piersma SR, Warmoes M, Jimenez CR (2010) On the beta binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics. _Bioinformatics_, 26(3):363-369. 1. Pham TV, Jimenez CR (2012) An accurate paired sample test for count data. _Bioinformatics_, 28(18):i596-i602.