--- title: "Plotting two chemical metrics" author: "Jeffrey M. Dick" output: html_document: mathjax: null toc: true css: style.css vignette: > %\VignetteIndexEntry{Plotting two chemical metrics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: chem16S.bib csl: elementa.csl link-citations: true --- ```{r HTML, include = FALSE} Zc <- "Z_C" nH2O <- "n_H₂O" ``` ```{r setup, include = FALSE} oldopt <- options(width = 80) ## Use pngquant to optimize PNG images library(knitr) knit_hooks$set(pngquant = hook_pngquant) pngquant <- "--speed=1 --quality=0-25" if (!nzchar(Sys.which("pngquant"))) pngquant <- NULL # To make warnings appear in text box 20230619 # https://selbydavid.com/2017/06/18/rmarkdown-alerts/ knitr::knit_hooks$set( error = function(x, options) { paste('\n\n

', gsub('##', '\n', gsub('^##\ Error', '**Error:**', x)), '

', sep = '\n') }, warning = function(x, options) { paste('\n\n

', gsub('##', '\n', gsub('^##\ Warning:', '**Warning:**', x)), '

', sep = '\n') }, message = function(x, options) { paste('\n\n

', gsub('##', '\n', x), '

', sep = '\n') } ) ## Don't evaluate chunks if phyloseq is not available 20230619 #if(!requireNamespace("phyloseq", quietly = TRUE)) { # knitr::opts_chunk$set(eval = FALSE) # warning("The **phyloseq** package is not available, so this vignette shows only the code without the results.") #} ``` Community-level chemical metrics are computed from the elemental compositions of community reference proteomes, which in turn are derived from genomic protein sequences weighted by taxonomic abundances. Unlike the synthetic variables in ordination methods (e.g. the principal components in PCA), chemical metrics are analogous to thermodynamic components that are defined independently of the data, so they can be compared across datasets. Their theoretical foundation and cross-dataset applicability facilitates the use of chemical metrics to explore hypotheses about genomic adaptation to multiple physicochemical variables at a global scale. In this vignette we'll analyze **phyloseq**'s `GlobalPatterns` dataset [based on data from @CLW+11] to visualize chemical variation of community reference proteomes across environments. Then, we'll explore specific hypotheses about the effects of redox conditions and salinity on genomic adaptation by analyzing datasets for microbial communities in marine sediment [@FEN+22] and geothermal waters [@ZFZ+23]. ## Load required packages ```{r load_packages, message = FALSE} require(chem16S) require(phyloseq) require(ggplot2) theme_set(theme_bw()) # For composing plots and making a common legend (plot_layout()) require(patchwork) # For annotating plots with regression coefficients (stat_poly_line()) require(ggpmisc) ``` This vignette was compiled on `r Sys.Date()` with **chem16S** version `r sessionInfo()$otherPkgs$chem16S$Version` and **phyloseq** version `r sessionInfo()$otherPkgs$phyloseq$Version`. ## GlobalPatterns We will use the `GlobalPatterns` dataset 'as-is', without the preprocessing described in **phyloseq**'s [Ordination Plots](https://joey711.github.io/phyloseq/plot_ordination-examples) tutorial. There, less-abundant OTUs and phyla were removed in order to show high-level trends and shorten computing time. One step we do take from that tutorial is the addition of a categorical variable that identifies whether the samples are human-associated: ```{r load_GlobalPatterns} data(GlobalPatterns) Human = get_variable(GlobalPatterns, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") sample_data(GlobalPatterns)$Human <- factor(Human) ``` Taxonomic assignments were made using the RDP Classifier without any additional training set specified [@CLW+11], so we use the `refdb = "RefSeq_206"` argument to apply the manual mappings between RDP and RefSeq described by @DT23. ```{r plot_metrics2.GlobalPatterns, collapse = TRUE} p2 <- plot_ps_metrics2(GlobalPatterns, color = "SampleType", shape = "Human", refdb = "RefSeq_206") ``` ```{r plot_metrics2.GlobalPatterns_geom, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} p2 + geom_polygon(aes(fill = SampleType), alpha = 0.5) + geom_point(size = 3) + guides(colour = guide_legend(override.aes = list(shape = c(17, 19, 19, 17, 19, 19, 17, 19, 17)))) ``` At the extremes of carbon oxidation state (`r Zc`), soil communities are the most oxidized and skin and tongue communities are the most reduced. At the extremes of stoichiometric hydration state (`r nH2O`), skin communities are the most hydrated and some fecal communities are the least hydrated. In more detail, there are environmental microbiomes that show similar ranges of chemical metrics (e.g., Freshwater (creek) and Sediment (estuary)) and others that are different. Freshwater -- described as "lake" by @CLW+11 -- has lower `r Zc` than Freshwater (creek), and some ocean samples have lower `r nH2O` than either freshwater group. These patterns could suggest an influence of greater oxygenation in flowing water compared to lakes (this is a distinction between lotic and lentic systems), and dehydration in communities adapted to life in salty water compared to freshwater. ## Humboldt Sulfuretum (Marine Sediment) ::: {.infobox .note} In the remainder of this vignette, we will use the GTDB reference database (which is the default in **chem16S**) because the taxonomic classifications for the datasets analyzed below were made using the GTDB training set provided by @Ali23. ::: @FEN+22 reported 16S rRNA gene sequences for sediment samples from the oxygen minimum zone of the Pacific Ocean along the coast of Chile, known as the Humboldt Sulfuretum. The sample data include dissolved oxygen, redox potential in sediment and overlying water, and organic matter (OM) content. This is a useful dataset for exploring the hypothesis that `r Zc` of proteins is shaped by redox conditions. Here we read the `phyloseq-class` object created by using DADA2 [@CMR+16] to identify amplicon sequence variants (ASVs) in this dataset and to classify them using the GTDB training set. A sample taken from 50 m depth at the Valparaiso location on 2012-05-12 is available in the Sequence Read Archive (SRA) but was not included in the analysis described by @FEN+22. The taxonomic composition of this sample is highly different from the all the others (see the ordination plots in the `extdata` directory where the `ps_FEN+22.rds` file is located), so we exclude it to avoid anomalous results. ```{r read_Humboldt} psfile <- system.file("extdata/DADA2-GTDB_214/FEN+22/ps_FEN+22.rds", package = "chem16S") ps <- readRDS(psfile) ps <- prune_samples(sample_names(ps) != "SRR1346095", ps) ps ``` Then, we plot `r Zc` and `r nH2O` for the community reference proteomes using different colors for sample groups. ```{r plot_metrics2.Humboldt, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} plot_ps_metrics2(ps, refdb = "GTDB_214", color = "Location") + geom_polygon(aes(fill = Location), alpha = 0.5) + geom_point(size = 3) ``` Among the areas with more than one sample, the community reference proteomes for Iquique are more reduced (i.e., have lower `r Zc`) than those for Concepcion and Valparaiso. Two of the communities at Valparaiso are also characterized by higher `r nH2O`, suggesting the influence of a hydrating factor. Let's take a step toward more quantitative tests of these hypotheses about genomic adaptation to environments. The color scales in the next two plots reflect sediment redox potential and concentration of organic matter. The rationale for choosing these environmental measurements is described below. ```{r check_patchwork, echo = FALSE} # Don't evaluate remaining chunks if patchwork is not available 20230627 if(!requireNamespace("patchwork", quietly = TRUE)) { knitr::opts_chunk$set(eval = FALSE) warning("The **patchwork** package is not available, so the remaining plots are not shown.") } ``` ```{r plot_metrics2.Humboldt_redox_OM, fig.width = 6, fig.height = 5, fig.align = "center", pngquant = pngquant} p2 <- plot_ps_metrics2(ps, refdb = "GTDB_214", color = "Sediment_redox") + geom_point(size = 4) + labs(color = "Sediment redox (Eh)") p3 <- plot_ps_metrics2(ps, refdb = "GTDB_214", color = "Organic_matter") + geom_point(size = 4) + labs(color = "Organic matter (%)") p2 / p3 ``` Thermodynamic considerations predict a positive correlation between redox potential and `r Zc` [@DM23]. It has also been predicted that salinity has a dehydrating effect that favors proteins with `r nH2O` [@DYT20]. However, these samples have no documented salinity gradient. Previous observations that protein expression in cells is shifted toward lower `r nH2O` under hyperglycemic (high-glucose) conditions [@DYT20] suggest another hypothesis: a higher content of organic matter may be a proxy for dehydrating conditions. We can use correlations between two environmental variables (redox potential or OM) and two chemical metrics for communities (`r Zc` or `r nH2O`) in order to test these hypotheses. To make the plots, let's construct a single data frame containing the sample data and chemical metrics. ```{r sample.data.and.chemical.metrics.for.communities} sample.data.and.chemical.metrics.for.communities <- cbind(sample_data(ps), ps_metrics(ps, refdb = "GTDB_214")) ``` Now let's write a function to create a scatter plot for two variables and add a regression line. We use this function to make a plot for each combination of environmental variable and chemical metric. ```{r check_ggpmisc, echo = FALSE} # Don't evaluate remaining chunks if ggpmisc is not available 20230627 if(!requireNamespace("ggpmisc", quietly = TRUE)) { knitr::opts_chunk$set(eval = FALSE) warning("The **ggpmisc** package is not available, so the remaining plots are not shown.") } ``` ```{r scatter_plot, echo = FALSE, fig.width = 7, fig.height = 6, fig.align = "center", pngquant = pngquant} ``` We find that carbon oxidation state is positively correlated with redox potential, and stoichiometric hydration state is negatively correlated with organic matter content. Taken alone, each of these correlations supports our initial hypotheses. However, in part because of strong covariation of the environmental variables, `r Zc` is also negatively correlated with OM content, and `r nH2O` is positively correlated with redox potential. The covariation of environmental variables makes it difficult to identify primary factors that drive the observed differences between communities. However, the chemical nature of these variables provides additional clues. The covariation of environmental variables (higher OM content with lower redox potential) makes sense if greater availability of organic compounds drives respiration and ensuing depletion of oxygen. This interaction among environmental variables yields a mechanistic hypothesis for the positive association between `r Zc` and `r nH2O` (see first plot above), which could not be explained by our initial hypotheses about the effects of single variables. ## Qinghai-Tibet Plateau (Hot Springs) @ZFZ+23 reported 16S rRNA gene sequences for mildly alkaline hot spring reservoirs in the Qinghai-Tibet Plateau. The following general predictions can be made about these data: * A positive correlation between `r Zc` of community reference proteomes and oxidation-reduction potential (ORP). * Because of input of reducing fluids, lower `r Zc` in this dataset compared to marine sediments of the Humboldt Sulfuretum. * Assuming that these samples have relatively low salinity, higher `r nH2O` than marine sediment communities. Let's test these predictions by doing some calculations. The following commands load the data and plot two environmental variables (ORP and temperature (T)) against two chemical metrics. ```{r load_Qinghai.Tibet} psfile2 <- system.file("extdata/DADA2-GTDB_214/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S") ps2 <- readRDS(psfile2) data.and.metrics <- cbind(sample_data(ps2), ps_metrics(ps2, refdb = "GTDB_214")) ps2 ``` ```{r scatter_plot_2, echo = FALSE, fig.width = 7, fig.height = 6, fig.align = "center", pngquant = pngquant} ``` The correlation between ORP and `r Zc` isn't as strong as might be predicted. Moreover, neither of the chemical metrics is strongly associated with temperature. Therefore, this dataset seems to be an exception to the notion that particular chemical metrics of community reference proteomes are shaped by the environment at a local scale. But let's not forget about the global-scale predictions! How do communities in hot springs compare to those in ocean sediments? In order to make a plot, we can merge both datasets into a new `phyloseq-class` object. The sequence processing pipeline assigned the same taxon names to both datasets (`ASV1`, `ASV2`, etc.). Therefore, let's append a letter to one set of names so that distinct taxa are not mistakenly combined. ```{r merge_datasets} taxa_names(ps2) <- paste0(taxa_names(ps2), "b") ps_merged <- merge_phyloseq(ps, ps2) ps_merged ``` Let's add a column to the sample data to indicate the type of environment for each dataset, then plot `r nH2O` against `r Zc`. ```{r plot_metrics2.merged, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} sample_data(ps_merged)$Environment <- ifelse(is.na(sample_data(ps_merged)$Depth), "Hot spring", "Marine sediment") plot_ps_metrics2(ps_merged, refdb = "GTDB_214", color = "Environment", shape = "Environment") + geom_point(size = 3) ``` In most cases, the communities from hot springs in the Qinghai-Tibet Plateau have lower `r Zc` and higher `r nH2O` compared to those in marine sediments of the Humboldt Sulfuretum. This outcome is consistent with predictions about genomic adaptation to relatively more reducing and less saline conditions of the hot springs. The positive association between `r Zc` and `r nH2O` for the sediment communities does not extend to the comparison between two datasets. This suggests different influences of environmental factors on community-level elemental composition at local and global scales. ## Comparing more datasets Let's add more data to the previous comparison, specifically freshwater and marine samples from the Baltic Sea salinity gradient [@HLA+16]. The first part of the following code chunk is adapted from the help page for `plot_metrics()`. A few lines are added to remove brackish samples and adjust the plotting symbols for the remaining samples. Then, the values for hot springs and sediments are appended to the data frame. Finally, new colors are assigned and the plot is made. *Note: this chunk does not depend on running the previous code, except for `library(chem16S)` and `library(phyloseq)`.*

```{r more_datasets, eval = FALSE} # Get data for the Baltic Sea salinity gradient from Herlemann et al., 2016 RDPfile <- system.file("extdata/RDP/HLA+16.tab.xz", package = "chem16S") RDP <- read_RDP(RDPfile) map <- map_taxa(RDP, refdb = "RefSeq_206") metrics <- get_metrics(RDP, map, refdb = "RefSeq_206") mdatfile <- system.file("extdata/metadata/HLA+16.csv", package = "chem16S") mdat <- get_metadata(mdatfile, metrics) # Take out brackish samples (6-20 PSU) ibrackish <- mdat$metadata$pch == 20 mdat$metadata <- mdat$metadata[!ibrackish, ] mdat$metrics <- mdat$metrics[!ibrackish, ] # Keep the values used for the plot mdat$metadata <- mdat$metadata[, c("name", "pch", "col")] mdat$metrics <- mdat$metrics[, c("Zc", "nH2O")] # Change symbols isalt <- mdat$metadata$pch == 21 ifresh <- mdat$metadata$pch == 24 mdat$metadata$pch[isalt] <- 24 mdat$metadata$pch[ifresh] <- 21 mdat$metadata$col[isalt] <- 3 mdat$metadata$col[ifresh] <- 4 # Append hot spring and sediment values FEN22 <- readRDS(system.file("extdata/DADA2-GTDB_214/FEN+22/ps_FEN+22.rds", package = "chem16S")) FEN22 <- prune_samples(sample_names(FEN22) != "SRR1346095", FEN22) FEN22_metrics <- ps_metrics(FEN22)[, c("Zc", "nH2O")] ZFZ23 <- readRDS(system.file("extdata/DADA2-GTDB_214/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S")) ZFZ23_metrics <- ps_metrics(ZFZ23)[, c("Zc", "nH2O")] mdat$metadata <- rbind(mdat$metadata, data.frame(name = "sediment", pch = 25, col = rep(7, nrow(FEN22_metrics)))) mdat$metadata <- rbind(mdat$metadata, data.frame(name = "hot spring", pch = 22, col = rep(2, nrow(ZFZ23_metrics)))) mdat$metrics <- rbind(mdat$metrics, FEN22_metrics, ZFZ23_metrics) # Change colors mdat$metadata$col[mdat$metadata$col == 2] <- (red <- "#db2828") mdat$metadata$col[mdat$metadata$col == 3] <- (green <- "#21ba45") mdat$metadata$col[mdat$metadata$col == 4] <- (blue <- "#2185d0") mdat$metadata$col[mdat$metadata$col == 7] <- (yellow <- "#fbbd08") # Create bold axis labels Zclab <- quote(bolditalic(Z)[bold(C)]) nH2Olab <- quote(bolditalic(n)[bold(H[2]*O)]) # Make the plot par(mar = c(4, 4, 1, 1)) par(cex.lab = 1.2, mgp = c(2.8, 1, 0)) pm <- plot_metrics(mdat, title = FALSE, xlab = Zclab, ylab = nH2Olab) # Add a legend legend <- c("Hot spring", "Freshwater", "Marine water", "Marine sediment") pch <- c(22, 21, 24, 25) pt.bg <- c(red, blue, green, yellow) legend("bottomleft", legend, pch = pch, col = 1, pt.bg = pt.bg, bg = "white", bty = "n") ```

```{r more_datasets, message = FALSE, results = "hide", echo = FALSE, fig.width = 5, fig.height = 4, fig.align = "center", pngquant = pngquant} ``` Notice how more reducing samples (hot spring and sediment) have relatively low `r Zc` and more saline samples (marine water and sediment) have relatively low `r nH2O`. ## Take-home messages * Chemical metrics are defined independently of the data and can be compared across datasets. * In some cases, covariation between `r Zc` and `r nH2O` suggests interactions between environmental variables. For instance, if lower redox potential is associated with greater organic matter content in sediments, this could drive a positive association between `r Zc` and `r nH2O`. * Variation of `r nH2O` may reflect the influence of diverse factors including salinity, organic matter content, and others. An open question is: What explains the very low `r nH2O` for communities in some fecal samples? * Multiple datasets can be used to examine global-scale influences of redox potential and salinity on genomic adaptation at the community level. More comprehensive tests of the prediction of a positive correlation between `r Zc` and redox potential at local and global scales have been reported by @DM23. ## References ```{r cleanup, include = FALSE} options(oldopt) ```