--- title: "assign.ensembleTax tutorial" author: "D Catlett" date: "5/18/2021" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{assign.ensembleTax tutorial} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction This vignette provides detailed examples to demonstrate the functionality of the *assign.ensembleTax* algorithm included with the ensembleTax package. For a more general demonstration of the ensembleTax package functionality/workflow, please go here: https://github.com/dcat4/ensembleTax ## The assign.ensembleTax algorithm *assign.ensembleTax*'s purpose is to synthesize taxonomic assignments made by any number of unique taxonomic assignment methods to determine an "ensemble" assignment for each ASV in a data set. *assign.ensembleTax* requires that each method's taxonomic assignments follow the same taxonomic nomenclature (naming and ranking conventions). If yours don't, use the *taxmapper* algorithm included with the ensembleTax package (see separate vignette included in this package). By default, *assign.ensembleTax* determines the ensemble taxonomic assignment for each ASV by finding the highest-frequency taxonomic assignment across the input taxonomic assignments (presumably) determined with different methods. *assign.ensembleTax* includes several user-tuneable parameters that balance obtaining assignments for a larger number of ASVs at lower taxonomic ranks (at the expense of a likely increase in false-positive annotations) with obtaining more robust assignments for fewer ASVs that are supported by multiple methods. Here we'll step through some examples to demonstrate how the algorithm works and how it's behavior can be modified by users. ### Examples To demonstrate the functionality of *assign.ensembleTax*, we'll first create an artificial set of ASVs and corresponding taxonomic assignments obtained with 2 different artificial "methods". Note these "methods" follow the same naming and ranking conventions. So first, load the ensembleTax package, and create the artificial data: ```{r} library("ensembleTax") packageVersion("ensembleTax") # create a fake taxonomy table of ASVs and taxonomic assignments taxtab1 <- data.frame(ASV = c("sv1", "sv2", "sv3", "sv4"), kingdom = c("Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota"), supergroup = c("Stramenopile", "Stramenopile", "Alveolata", "Rhizaria"), division = c("Ochrophyta", NA, "Dinoflagellata", NA), class = c("Bacillariophyta", NA, NA, NA), genus = c("Pseudo-nitzschia", NA, NA, NA)) taxtab2 <- data.frame(ASV = c("sv1", "sv2", "sv3", "sv4"), kingdom = c("Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota"), supergroup = c("Stramenopile", "Alveolata", "Alveolata", "Stramenopile"), division = c("Ochrophyta", "Dinoflagellata", "Dinoflagellata", NA), class = c("Bacillariophyta", NA, "Syndiniales", NA), genus = c("Pseudo-nitzschia", NA, NA, NA)) # look at your artificial data: taxtab1 taxtab2 ``` We see across our 4 ASVs, we have certain ASVs for which the assigned taxonomy is identical across the two tables, or that vary in the number of ranks with assigned names, in the names that were assigned, or both. In what follows we'll see how one might obtain different ensemble assignments based on these data for various scientific questions and/or based on assumptions about the underlying methods employed to obtain each collection of taxonomic assignments. Now would be a good time to review the *assign.ensembleTax* documentation to get a sense of the different parameter spaces available. Here we'll try to demonstrate what these different parameters are doing. #### Example 1: Simply obtain the highest frequency assignments In this example we'll run *assign.ensembleTax* with our two taxonomy tables. *assign.ensembleTax* expects a named list of dataframes with each element corresponding to a uniquely-named taxonomy table, so we'll create that first and then compute ensemble taxonomic assignments with the default parameters. ```{r} xx <- list(taxtab1, taxtab2) names(xx) <- c("tab1","tab2") eTax.def <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx)), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 taxtab2 eTax.def ``` Let's break this down. We saw that for sv1, the assignments were in perfect agreement across the two tables and this assignment has been retained in the ensemble. sv2 and sv4 disagreed at the supergroup rank, and so have been left unassigned at the supergroup rank in the ensemble (there is no "highest-frequency" assignment where only two tables are provided and they disagree). For sv3, the assignments were identical down to the division rank, but one table was unassigned (assigned NA) at the class rank while the other was assigned to *Syndiniales*. When *count.NA = TRUE*, NA values are counted as assignments and so again there was no highest frequency assignment at the class rank, resulting in the ensemble being assigned only to division where "Dinoflagellata" was assigned in both input tables. More on that in a sec. First, we'll add in a third taxonomy table that is identical to taxtab1 and compute ensembles with all 3: ```{r} # create a 3rd fake taxonomy table of ASVs and taxonomic assignments taxtab3 <- taxtab1 xx.with3 <- list(taxtab1, taxtab2, taxtab3) names(xx.with3) <- c("tab1", "tab2", "tab3") eTax.def <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.def ``` With 3 taxonomy tables, we see sv2 and sv4 are now assigned at the supergroup rank in the ensemble. For each of these, both taxtab1 and taxtab3 agreed in their assignments, meaning these assignments were found at a higher frequency than the conflicting supergroup assignments in taxtab2. #### Example 2: The "count.na" argument Here we'll create two ensembles again with the same combination of taxonomy tables, but we'll set *count.na = FALSE*. This adjustment is meant for users who want to increase the number of annotated ASVs but likely comes at the expense of an increase in false positive annotations. ```{r} eTax.nona2 <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx)), tiebreakz = NULL, count.na=FALSE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.nona2 eTax.nona3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=FALSE, assign.threshold = 0) # ensemble with 3 tables: eTax.nona3 ``` Here using the 2 taxonomy tables, we see the ensemble assignments for sv1, 2, and 4 are identical to the first example. Where there are conflicting assignments that are not NA, the *count.na* argument has no impact on ensemble determinations, as you can see in the first ensemble we computed. However, for sv3, we see that we now have an ensemble assignment at the class rank ("Syndiniales") because the NA assignment wasn't counted (in other words, there was 1 Syndiniales and 0 other assignments that were not NA, so the highest frequency assignment was Syndiniales). You might be somewhat surprised to see that sv3's ensemble class assignment is still Syndiniales when we considered all 3 taxonomy tables. In this case NA was the highest frequency assignment, but by setting *count.na = FALSE* we ignored the NA assignments and so the highest frequency assignment in the absence of NA's was assigned as the ensemble. #### Example 3: Breaking ties with the "tiebreakz" argument In this example we'll count NA's again but we'll specify that we'd like to prioritize particular taxonomy tables in the event that multiple disagreeing assignments are found at the highest frequency. Again, using our same 3 tables as above. ```{r} eTax.tb2 <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx)), tiebreakz = c("tab2"), count.na=TRUE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.tb2 eTax.tb3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = c("tab2"), count.na=TRUE, assign.threshold = 0) # ensemble with 3 tables: eTax.tb3 ``` In both ensembles we prioritized assignments in taxtab2 in the event of ties. When only 2 taxonomy tables were used to compute the ensemble, the ensemble assignments are identical to taxtab2 (the table we chose to break ties with). Makes sense. The ensemble computed with 3 tables was not impacted by tie-breaking because taxtab1 and 3 are identical, and so all assignments in these two tables will be found at the highest frequency and there will be no ties. One more example computing an ensemble with taxtabs 1 and 2, but prioritizing taxtab1 to break ties and NOT counting NA's: ```{r} eTax.tb2 <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx)), tiebreakz = c("tab1"), count.na=FALSE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 taxtab2 eTax.tb2 ``` If you'd like to primarily rely on assignments from one assignment method, but use a second method to fill in annotations where your favorite method is not assigned (assigned NA), you can specify a tiebreaker and ignore NA's in your ensemble assignment determinations. #### Example 4: Prioritizing methods with the "weights" argument Another way you can prioritize assignments from one (or more) particular assignment method is by using the weights argument. Weights are specified as integers in the order corresponding to the order of taxonomy tables in the list you supply to *assign.ensembleTax*. Weighting one table more highly than the other in ensembles determined from only two taxonomy tables will result in identical behavior as tie-breaking. Below we've weighted assignments in taxtab1 double those found in taxtab2: ```{r} # counting NA's: eTax.wt2 <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(2,1), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.wt2 # NOT counting NA's: eTax.wt2 <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(2,1), tiebreakz = NULL, count.na=FALSE, assign.threshold = 0) # show the initials and ensemble for ease-of-interpretation: eTax.wt2 ``` Just as expected. The ensemble assignments are identical to taxtab1 when we count NA's, but when we ignore NA's the Syndiniales assignment is filled in by taxtab2. What happens when we compute a 3-table ensemble but weight the table that disagrees with the other two (taxtab2) 2x? Let's see: ```{r} eTax.wt3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(1,2,1), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0) taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.wt3 ``` We see that anywhere where taxonomic assignments were in disagreement, they are not assigned in the ensemble. This is because taxtab 1 and 3 are identical to one another, and we've weighted taxtab2 double. So where assignments disagree there are multiple assignments with the highest frequency. That means we need to specify a tiebreaker if we want to avoid the above scenario. Let's try: ```{r} eTax.wttb3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(1,2,1), tiebreakz = c("tab1"), count.na=TRUE, assign.threshold = 0) taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.wttb3 ``` And now our ensemble determinations match taxtab1 (and 3) where there were disagreements with taxtab2. We can prioritize taxtab2 and see the opposite: ```{r} eTax.wttb3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(1,2,1), tiebreakz = c("tab2"), count.na=TRUE, assign.threshold = 0) taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.wttb3 ``` Just as we expected. #### Example 5: the "assign.threshold" argument We have one last argument to address: *assign.threshold*. This argument interacts with some of the other arguments we've looked at above in ways that may be counter-intuitive to some, so make sure you understand what each of those arguments are doing before spending too much time with *assign.threshold*. Let's take a look at some different thresholds alongside our default parameters first. We'll see how changing *assign.threshold* can impact tiebreaking and weighting first: ```{r} # tie-breaking to prioritize table 1, but with assign.threshold = 60% eTax.at <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx)), tiebreakz = c("tab1"), count.na=TRUE, assign.threshold = 0.6) # show the initials and ensemble for ease-of-interpretation: taxtab1 # (remember taxtab3 is identical to this, so count 2x) taxtab2 eTax.at # take away the tiebreaker and weight table 1 2x: eTax.at <- assign.ensembleTax(xx, tablenames = names(xx), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=c(2,1), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0.6) eTax.at ``` Here we see that the *assign.threshold* argument "over-rules" tie-breaking if the threshold is not satisfied. Although we specified taxtab1 as the tie-breaker, by designating *assign.threshold = 0.6*, we've required ensemble assignments to be found in at least 60% of the (weighted) assignments. This means disagreements were left unassigned here. If we take a look at the 2nd ensemble we computed (where tie-breaking was omitted and we instead weighted taxtab1 2x), we see that our ensemble mirrors taxtab1. This is because the *assign.threshold* argument operates on weighted assignment frequencies. In this case, where taxtab 1 and 2 disagreed, the assignments in taxtab1 comprised 66% of the weighted assignments, which was larger than our *assign.threshold*. Let's try a couple thresholds applied to 3-table ensemble determinations: ```{r} # a low threshold: eTax.at3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0.5) eTax.at3 # a high threshold (need all 3 to agree here for ensemble assignment): eTax.at3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=TRUE, assign.threshold = 0.9) eTax.at3 ``` In our first example, the threshold we applied (0.5) has no impact on the ensemble assignments because taxtab1 and 3 are identical, and therefore their assignments will always comprise more than 50% of the assignments for any ASV. In our second example, the threshold of 0.9 (and in this case, any threshold > 0.67) means that the ensemble is only assigned where all three input taxonomy tables are in agreement. Finally, we'll take a look at how *assign.threshold* behaves when *count.na = FALSE*: ```{r} # a low threshold with count.na = FALSE: eTax.at3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=FALSE, assign.threshold = 0.5) eTax.at3 # a high threshold with count.na = FALSE (need all 3 to agree here for ensemble assignment): eTax.at3 <- assign.ensembleTax(xx.with3, tablenames = names(xx.with3), ranknames = colnames(taxtab1)[2:ncol(taxtab1)], weights=rep(1,length(xx.with3)), tiebreakz = NULL, count.na=FALSE, assign.threshold = 0.9) eTax.at3 ``` We again see that when the threshold is low (0.5), it has little impact on our ensemble assignments. However, implementing a high threshold (0.9) in conjunction with *count.na = FALSE* notably alters the ensemble assignments. For sv2 and sv4, where taxonomic names (and not NA's) were assigned and disagreed across the 3 tables, the ensemble remains unassigned because no single assignment was found at a frequency greater than 90%. However, for sv3, where 66% of the input class assignments were NA and 33% were Syndiniales, the ensemble is assigned to the class Syndiniales. This is because the *assign.threshold* does not consider NA assignments when *count.na = FALSE*. In other words, because NA's were ignored, in this example the Syndiniales assignment comprised 100% of the input assignments and thus surpassed the *assign.threshold*. ### Parameter summary and recommendations for different scientific objectives As you can see in the above, the *assign.ensembleTax* algorithm allows for flexible computations of ensemble taxonomic assignments to suit different scientific questions and applications. There are trade-offs in adjusting each of the above parameters and here we'll discuss what those are and how you might implement *assign.ensembleTax* for your own objectives. The big trade-off one should consider when implementing *assign.ensembleTax* is how to balance annotating more ASVs at lower ranks vs. only annotating ASVs where annotations are supported by multiple methods (and thus are likely to be quite robust). The former likely comes at the expense of increased false positive annotations (ASVs assigned to a lineage where there really is not enough information to make this determination), while the latter likely comes at the expense of increased false negative annotations (assigning NA where there IS enough information to assign the ASV to a lineage). Adjusting the parameters in *assign.ensembleTax* allows you to decide where you fall on this spectrum. First, we'll consider parameters that will promote robust annotations where ASVs are assigned to a taxonomic group at the expense of assigning taxonomy to a smaller number of ASVs/ranks. Arguments that would favor this strategy are: (1) *count.na = TRUE* (2) *assign.threshold = [a high value, say > 0.5]* (3) *tiebreakz = NULL* One could use all of the above settings, or select a few. Broadly, all of these parameters require a taxonomic assignment for a particular ASV to be found in multiple input taxonomy tables in order for it to be assigned to the ensemble. Implementing all of these simultaneously will result in extremely conservative, but extremely well-supported taxonomic assignments. These settings are probably most appropriate for studies that require precise identification of particular ASVs. While lower-rank assignments should in general be interpreted with caution, those supported by multiple methods are likely less error-prone than those determined with a single method. Let's say you think (or know), that assignments made by one particular method are more accurate than the other methods you've considered. Perhaps the reference database uses a more robust annotation procedure, or benchmarking exercises show a particular classifier has very low error rates. You can prioritize assignments made by this method with the *tiebreakz* or *weights* argument. If you wanted to use secondary methods to "fill in" assignments where your preferred method determined that taxonomy could not be assigned to an ASV, you could subsequently change the *count.na* argument. These changes bring you closer to the other end of the spectrum... At the other end of the spectrum, we'll consider parameters that promote obtaining annotations at lower ranks for a greater number of ASVs at the expense of potentially increasing the number of spurious classifications (either assigning ASVs to an incorrect lineage, or assigning an ASV to a lineage when there is not sufficient phylogenetic resolution to do so). Arguments that would favor this strategy are: (1) *count.na = FALSE* (2) *assign.threshold = [a low value, say 0]* (3) *tiebreakz = [specify the names of ALL input tables in order of priority]* Broadly, all of these parameters favor an increase in the number of ASVs assigned at lower ranks, but these assignments may only be supported by a single method and so may be less robust. These settings are probably most appropriate for studies focused on very broad taxonomic groupings; lower-rank assignments should in general be interpreted with caution, and this is particularly true when ensemble assignments are computed with these parameters. That brings us to the end of this tutorial. We include some code for comparing ensemble taxonomic assignments in our ensembleTax package overview vignette, and encourage you to perform such comparisons to test out different settings and optimize ensemble assignments for your particular scientific objectives.