ZicoSeq
is a linear model and permutation-based method
for dfferential abundance analysis of zero-inflated compositional
sequencing data such as microbiome sequencing data. It has the following
components:
Currently ZicoSeq
supports:
count data or proportion data. For both count and proportion data, a reference-based ratio approach is used to account for compositional effects. When a count matrix is provided, it provides an option to draw posterior samples of the underlying proportions to account for the sampling variability during the sequencing process. The test results are aggregated over these posterior samples.
other data types. As a general methodology,
ZicoSeq
can be used to differential analysis of other
high-dimensional datasets such as transcriptomics, epigenomics,
metabolomics, and proteomics data.
Install the package.
Load the package.
ZicoSeq
This example data set contains the OTU count data from the throat
microbiome of the left body side, which is available in
GUniFrac
package. It contains 60 subjects consisting of 32
nonsmokers and 28 smokers (Charlson,
Emily S., et al., 2010).. In this example, we will identify
smoking-associated OTUs while adjusting the sex since sex is a potential
confounder (we see more males in smokers).
ZicoSeq
functionZicoSeq.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "count",
# Filter to remove rare taxa
prev.filter = 0.2, mean.abund.filter = 0,
max.abund.filter = 0.002, min.prop = 0,
# Winsorization to replace outliers
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling
is.post.sample = TRUE, post.sample.no = 25,
# Use the square-root transformation
link.func = list(function (x) x^0.5), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple stage normalization
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = TRUE)
ZicoSeq
Microbiome data may contain outliers with extremely large counts.
These outliers could reduce the statistical power, or more seriously
increase the type I error rate. In ZicoSeq
, we use
winsorization to replace those extremely large counts with a certain
quantile (e.g. 97%). We recommend always setting
is.winsor = TRUE
for real data analysis. The specific
quantile used depends on the expected proportion of outliers.
Posterior sampling can be enabled by setting
is.post.sample = TRUE
. It can only be used when the data
type is “count”. For microbiome data, the majority of the taxa are rare,
resulting in a large number of zeros in the data given a limited
sequencing depth. As the probability of zero strongly depends on the
sequencing depth, zeros from different samples are not comparable. In
ZicoSeq
, we provide a method to infer/impute the underlying
true proportions using an empirical Bayes approach by pooling
information across samples. The imputed proportions depend on both the
sequencing depth and the estimated (prior) distribution of the
proportions across samples. To better model the prior distribution, we
used beta mixtures instead a single-component beta distribution.
Posterior sampling effectively reduces the type I error when the
sequencing depth is associated with the covariate of interest
(e.g. cases and controls differ in sequencing depth) and is more
powerful than rarefaction. When the sequencing depth is not a
confounding factor, is.post.sample
slightly improves the
power and could be disabled for analyzing large datasets.
The relationship between the taxa abundance and the covariate of
interest may vary across taxa, a single transformation function may not
be powerful enough to capture diverse relationships.
ZicoSeq
allows specifying multiple possible transformation
functions (link.func
) and performs omnibus
testing. Although the log transformation is commonly used for microbiome
due to its interpretability, we found that power transformations may be
more powerful. The default is the square-root transformation.
ZicoSeq
uses a reference-based approach to address
compositional effects. The default uses 30% of the least
variable/significant taxa as the reference. It starts with selecting 50%
of the least variable taxa as the reference and differential abundance
testing is then run multiple times excluding 20% taxa with the smallest
p-values in each iteration. This procedure is to make sure the remaining
30% are more likely to be non-differential.
ZicoSeq
uses a permutation-based approach to control
false discovery rate (FDR). The permutation-based FDR control preserves
the correlation structure in the abundance data and is more robust and
powerful than the traditional BH-based FDR control based on raw p
values. ZicoSeq
also allows performing permutation-based
family-wise error rate control. Users can determine the appropriate
error control method to suit their needs.
ZicoSeq
outputThe output of ZicoSeq
consists of mainly:
p.raw
: raw p-values based on permutations, it will not
be accurate if perm.no
is small.p.adj.fdr
: permutation-based FDR-adjusted
p-values.p.adj.fwer
: permutation-based FWER-adjusted
(West-Young), only if is.fwer = T
in ZicoSeq
.
perm.no
is recommened to set at least 999 for accurate
FWER-adjusted p-values.A matrix of R^2 values (number of features by number of transformation functions) will be provided. R^2 is an effect size measure and can be used to assess the association strength between the taxa abundance and the covariate of interest while adjusting for other covariates. When the omnibus testing is in action, i.e., multiple transformations are used, there will be multiple R^2 s, each corresponding to a specific transformation.
ZicoSeq
output visualizationZicoSeq.plot
function produces a volcano plot with the
y-axis being the log10 (adjusted) p-value and the x-axis being the
signed R^2 with the sign indicating the association direction determined
based on the sign of the regression coefficient (for mutli-categorical
variables, sign is not applicable). When multiple transformation
functions are used, the largest R^2 is used. The names of differential
taxa passing a specific significance cutoff will be printed on the
figure. When data types are counts and proportions, the mean abundance
and prevalence will be visualized; when the data type is
other
, mean and standard deviation of the features will be
visualized. Users need to set return.feature.dat = T
when
using the plot function.
For some bioinformatics pipelines, the output could be proportion
data. ZicoSeq
can be applied to proportion data by
specifying feature.dat.type = "proportion"
. Posterior
sampling will not be applied when analyzing the proportions and other
parameter settings are similar to the count case.
comm.p <- t(t(comm) / colSums(comm))
ZicoSeq.obj.p <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.p,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "proportion",
# Filter to remove rare taxa
prev.filter = 0.2, mean.abund.filter = 0, max.abund.filter = 0.002, min.prop = 0,
# Winsorization to replace outliers
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling will be automatically disabled
is.post.sample = FALSE, post.sample.no = 25,
# Use the square-root transformation
link.func = list(function (x) x^0.5, function (x) x^0.25), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple stage normalization
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = T)
suppressWarnings(ZicoSeq.plot(ZicoSeq.obj = ZicoSeq.obj.p, pvalue.type = 'p.adj.fdr',
cutoff = 0.1, text.size = 10, out.dir = NULL, width = 10, height = 6))
ZicoSeq
as a general linear model-based permutation test
can be applied to association analyses of other high-dimensional
datasets such as transcriptome-, methylome-, metabolome- and
proteome-wide association testing, where the omics features are treated
as the outcomes. In this case, posterior sampling will be automatically
disabled. The user should be responsible for properly normalizing,
transforming, and filtering the data before applying
ZicoSeq
. The user should also decide whether to winsorize
the top, bottom or both ends of the distribution. The following code is
just for demonstration purposes and does not mean to be applied for
differential abundance analysis.
comm.o <- comm[rowMeans(comm != 0) >= 0.2, ] + 1
comm.o <- log(t(t(comm.o) / colSums(comm.o)))
ZicoSeq.obj.o <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.o,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "other",
# Filter will not be applied
prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, min.prop = 0,
# Winsorization the top end
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling will be automatically disabled
is.post.sample = FALSE, post.sample.no = 25,
# Identity function is used
link.func = list(function (x) x), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple-stage normalization will not be performed
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = T)
ZicoSeq.plot(ZicoSeq.obj = ZicoSeq.obj.o, pvalue.type = 'p.adj.fdr',
cutoff = 0.1, text.size = 10, out.dir = NULL, width = 10, height = 6)
ZicoSeq
can be applied to perform within-subject
comparisons (e.g. before treatment vs. after treatment) by using the
strata
parameter. For example, if the subject variable is
“subject_id” in the meta data file, the user can set
strata = 'subject_id'
.
Lu Yang & Jun Chen. 2022. A comprehensive evaluation of differential abundance analysis methods: current status and potential solutions. Microbiome. In revision.
Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, Sinha R, Hwang J, Bushman FD, Collman RG. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One. 2010 Dec 20;5(12):e15216. doi: 10.1371/journal.pone.0015216. PMID: 21188149; PMCID: PMC3004851.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GUniFrac_1.8 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 generics_0.1.3 lattice_0.22-6
## [4] statip_0.2.3 digest_0.6.37 magrittr_2.0.3
## [7] evaluate_1.0.1 grid_4.4.2 iterators_1.0.14
## [10] fastmap_1.2.0 foreach_1.5.2 jsonlite_1.8.9
## [13] Matrix_1.7-1 ggrepel_0.9.6 ape_5.8-1
## [16] modeest_2.4.0 mgcv_1.9-1 scales_1.3.0
## [19] stabledist_0.7-2 permute_0.9-7 codetools_0.2-20
## [22] jquerylib_0.1.4 timeSeries_4041.111 cli_3.6.3
## [25] rlang_1.1.4 munsell_0.5.1 splines_4.4.2
## [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [31] vegan_2.6-8 tools_4.4.2 inline_0.3.20
## [34] rmutil_1.1.10 parallel_4.4.2 stable_1.1.6
## [37] dplyr_1.1.4 colorspace_2.1-1 ggplot2_3.5.1
## [40] rpart_4.1.23 buildtools_1.0.0 vctrs_0.6.5
## [43] R6_2.5.1 matrixStats_1.4.1 lifecycle_1.0.4
## [46] clue_0.3-66 MASS_7.3-61 cluster_2.1.8
## [49] pkgconfig_2.0.3 bslib_0.8.0 pillar_1.10.0
## [52] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1
## [55] statmod_1.5.0 xfun_0.49 tibble_3.2.1
## [58] tidyselect_1.2.1 sys_3.4.3 knitr_1.49
## [61] farver_2.1.2 spatial_7.3-17 fBasics_4041.97
## [64] htmltools_0.5.8.1 nlme_3.1-166 labeling_0.4.3
## [67] maketools_1.3.1 timeDate_4041.110 compiler_4.4.2