Title: | Statistical Methods for Microbiome Compositional Data |
---|---|
Description: | A suite of methods for powerful and robust microbiome data analysis addressing zero-inflation, phylogenetic structure and compositional effects (Zhou et al. (2022)<doi:10.1186/s13059-022-02655-5>). The methods can be applied to the analysis of other (high-dimensional) compositional data arising from sequencing experiments. |
Authors: | Xianyang Zhang [aut], Jun Chen [aut, cre], Huijuan Zhou [ctb] |
Maintainer: | Jun Chen <[email protected]> |
License: | GPL-3 |
Version: | 1.2 |
Built: | 2024-10-29 06:20:44 UTC |
Source: | CRAN |
The function implements a simple, robust and highly scalable approach to tackle the compositional effects in differential abundance analysis of high-dimensional compositional data. It fits linear regression models on the centered log2-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. It could fit mixed-effect models for analysis of correlated data.
linda( feature.dat, meta.dat, formula, feature.dat.type = c('count', 'proportion'), prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, is.winsor = TRUE, outlier.pct = 0.03, adaptive = TRUE, zero.handling = c('pseudo-count', 'imputation'), pseudo.cnt = 0.5, corr.cut = 0.1, p.adj.method = "BH", alpha = 0.05, n.cores = 1, verbose = TRUE )
linda( feature.dat, meta.dat, formula, feature.dat.type = c('count', 'proportion'), prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, is.winsor = TRUE, outlier.pct = 0.03, adaptive = TRUE, zero.handling = c('pseudo-count', 'imputation'), pseudo.cnt = 0.5, corr.cut = 0.1, p.adj.method = "BH", alpha = 0.05, n.cores = 1, verbose = TRUE )
feature.dat |
a matrix of counts/proportions, row - features (OTUs, genes, etc) , column - samples. |
meta.dat |
a data frame containing the sample meta data. If there are NAs, the corresponding samples will be removed in the analysis. |
formula |
a character string for the formula. The formula should conform to that used by |
feature.dat.type |
the type of the feature data. It could be "count" or "proportion". |
prev.filter |
the prevalence (percentage of non-zeros) cutoff, under which the features will be filtered. The default is 0. |
mean.abund.filter |
the mean relative abundance cutoff, under which the features will be filtered. The default is 0. |
max.abund.filter |
the max relative abundance cutoff, under which the features will be filtered. The default is 0. |
is.winsor |
a logical value indicating whether winsorization should be performed to replace outliers (high values). The default is TRUE. |
outlier.pct |
the expected percentage of outliers. These outliers will be winsorized. The default is 0.03. |
adaptive |
a logical value indicating whether the approach to handle zeros (pseudo-count or imputation)
will be determined based on the correlations between the log(sequencing depth) and the explanatory variables
in |
zero.handling |
a character string of 'pseudo-count' or 'imputation' indicating the zero handling method
used when |
pseudo.cnt |
a positive numeric value for the pseudo-count to be added if |
corr.cut |
a numerical value between 0 and 1, indicating the significance level used for determining
the zero-handling approach when |
p.adj.method |
a character string indicating the p-value adjustment approach for
addressing multiple testing. See R function |
alpha |
a numerical value between 0 and 1 indicating the significance level for declaring differential features. Default is 0.05. |
n.cores |
a positive integer. If |
verbose |
a logical value indicating whether the trace information should be printed out. |
A list with the elements
variables |
A vector of variable names of all fixed effects in |
bias |
numeric vector; each element corresponds to one variable in |
output |
a list of data frames with columns 'baseMean', 'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj', 'reject',
'df';
|
covariance |
a list of data frames; the data frame records the covariances between a regression coefficient with other coefficients;
|
otu.tab.use |
the OTU table used in the abundance analysis (the |
meta.use |
the meta data used in the abundance analysis (only variables in |
wald |
a list for use in Wald test. If the fitting model is a linear model, then it includes
If the fitting model is a linear mixed-effect model, then it includes
|
Huijuan Zhou, Jun Chen, Xianyang Zhang
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind]), Site = factor(smokers$meta$SIDEOFBODY[ind]), SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind])) # Differential abundance analysis using the left throat data ind1 <- meta$Site == 'Left' & depth >= 1000 linda.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) linda.plot(linda.obj, c('Smokey', 'Sexmale'), titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'), alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL, width = 11, height = 8) rownames(linda.obj $output[[1]])[which(linda.obj $output[[1]]$reject)] # Differential abundance analysis pooling both the left and right throat data # Mixed effects model is used ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) # For proportion data otu.tab.p <- t(t(otu.tab) / colSums(otu.tab)) ind1 <- meta$Site == 'Left' & depth >= 1000 lind.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', feature.dat.type = 'proportion', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1)
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind]), Site = factor(smokers$meta$SIDEOFBODY[ind]), SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind])) # Differential abundance analysis using the left throat data ind1 <- meta$Site == 'Left' & depth >= 1000 linda.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) linda.plot(linda.obj, c('Smokey', 'Sexmale'), titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'), alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL, width = 11, height = 8) rownames(linda.obj $output[[1]])[which(linda.obj $output[[1]]$reject)] # Differential abundance analysis pooling both the left and right throat data # Mixed effects model is used ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) # For proportion data otu.tab.p <- t(t(otu.tab) / colSums(otu.tab)) ind1 <- meta$Site == 'Left' & depth >= 1000 lind.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', feature.dat.type = 'proportion', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1)
The function produces the effect size plot of the differential features and volcano plot based on the output from linda
.
linda.plot( linda.obj, variables.plot, titles = NULL, alpha = 0.05, lfc.cut = 1, legend = FALSE, directory = NULL, width = 11, height = 8 )
linda.plot( linda.obj, variables.plot, titles = NULL, alpha = 0.05, lfc.cut = 1, legend = FALSE, directory = NULL, width = 11, height = 8 )
linda.obj |
return from function |
variables.plot |
vector; variables whose results are to be plotted. For example, suppose the return
value |
titles |
vector; titles of the effect size plot and volcano plot for each variable in |
alpha |
a numerical value between 0 and 1; cutoff for |
lfc.cut |
a positive numerical value; cutoff for |
legend |
TRUE or FALSE; whether to show the legends of the effect size plot and volcano plot. |
directory |
character; the directory to save the figures, e.g., |
width |
the width of the graphics region in inches. See R function |
height |
the height of the graphics region in inches. See R function |
A list of ggplot2
objects.
plot.lfc |
a list of effect size plots. Each plot corresponds to one variable in |
plot.volcano |
a list of volcano plots. Each plot corresponds to one variable in |
Huijuan Zhou, Jun Chen, Xianyang Zhang
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' & smokers$meta$SIDEOFBODY == 'Left' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind])) ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) linda.plot(linda.obj, c('Smokey', 'Sexmale'), titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'), alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL, width = 11, height = 8)
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' & smokers$meta$SIDEOFBODY == 'Left' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind])) ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) linda.plot(linda.obj, c('Smokey', 'Sexmale'), titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'), alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL, width = 11, height = 8)
The function implements Wald test for bias-corrected regression coefficients generated from the linda
function.
One can utilize the function to perform ANOVA-type analyses. For example, if you have a cateogrical variable with three levels, you can test whether all levels have the same effect.
linda.wald.test( linda.obj, L, model = c("LM", "LMM"), alpha = 0.05, p.adj.method = "BH" )
linda.wald.test( linda.obj, L, model = c("LM", "LMM"), alpha = 0.05, p.adj.method = "BH" )
linda.obj |
return from the |
L |
A matrix for testing |
model |
|
alpha |
significance level for testing |
p.adj.method |
P-value adjustment approach. See R function |
A data frame with columns
Fstat |
Wald statistics for each taxon |
df1 |
The numerator degrees of freedom |
df2 |
The denominator degrees of freedom |
pvalue |
|
padj |
|
reject |
|
Huijuan Zhou [email protected] Jun Chen [email protected] Xianyang Zhang [email protected]
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind]), Site = factor(smokers$meta$SIDEOFBODY[ind]), SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind])) ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) # L matrix (2x3) is designed to test the second (Smoke) and the third (Sex) coefficient to be 0. # For a categorical variable > two levels, similar L can be designed to do ANOVA-type test. L <- matrix(c(0, 1, 0, 0, 0, 1), nrow = 2, byrow = TRUE) result <- linda.wald.test(linda.obj, L, 'LMM', alpha = 0.1)
data(smokers) ind <- smokers$meta$AIRWAYSITE == 'Throat' otu.tab <- as.data.frame(smokers$otu[, ind]) depth <- colSums(otu.tab) meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]), Sex = factor(smokers$meta$SEX[ind]), Site = factor(smokers$meta$SIDEOFBODY[ind]), SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind])) ind <- depth >= 1000 linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1) # L matrix (2x3) is designed to test the second (Smoke) and the third (Sex) coefficient to be 0. # For a categorical variable > two levels, similar L can be designed to do ANOVA-type test. L <- matrix(c(0, 1, 0, 0, 0, 1), nrow = 2, byrow = TRUE) result <- linda.wald.test(linda.obj, L, 'LMM', alpha = 0.1)
A dataset containing "otu", "tax", meta", "genus", family"
data(smokers)
data(smokers)
A list with elements
otu table, 2156 taxa by 290 samples
taxonomy table, 2156 taxa by 7 taxonomic ranks
meta table, 290 samples by 53 sample variables
304 by 290
113 by 290
https://qiita.ucsd.edu/ study ID:524 Reference: Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, et al. (2010) Disordered Microbial Communities in the Upper Respiratory Tract of Cigarette Smokers. PLoS ONE 5(12): e15216.