--- title: "MonoPhy Tutorial" author: "[Orlando Schwery](https://oschwery.github.io/)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MonoPhy Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## 1 Introduction ## This is a tutorial vignette to provide a quick and easy introduction to the use of MonoPhy. It will make use of the inbuilt example files and lead through the most important functions of the package. In order to be aware of discrepancies between a newly generated phylogeny and the current taxonomy, it is necessary to assess monophyly of the taxa in the tree. MonoPhy allows doing this in a quick and simple way, both using the phylogeny only and including a file assigning tips to taxonomic groups. Notably, such a file can contain any type of information for which to test monophyly in a tree ( _e.g._, traits or geographic occurrence), as long as attention is paid to that the assessment makes sense biologically. ## 2 Installation ## ### 2.1 CRAN ### In order to install the stable CRAN version of the package: ```{r, eval=FALSE} install.packages("MonoPhy") ``` ### 2.2 Development Version from GitHub ### If - for any reason - you would want to install the development version from GitHub, I suggest doing so by installing it temporarily with the help of the package `devtools`: ```{r, eval=FALSE} # 1. Install package 'devtools' (if not already installed): install.packages("devtools") # 2. Load 'devtools' and temporarily install and load 'MonoPhy' in developers mode: library(devtools) dev_mode(on=T) install_github("oschwery/MonoPhy") # install the package from GitHub library(MonoPhy) # load the package #3. Leave developers mode after done trying out 'MonoPhy': dev_mode(on=F) # the package will not stay on your system permanently like that, which make sense in case of the # development version, as opposed to the stable version ``` ## 3 Use ## When we load the package, all its functions and the other required packages are imported. ``` {r, results="hide", warning=FALSE, message=FALSE} library(MonoPhy) ``` ### 3.1 Load Data ### #### 3.1.1 Example Files #### The package comes with example files for you to try out its functions, which will be used throughout this tutorial. The phylogeny is of the plant family Ericaceae, as published in Schwery _et al._ (2015) pruned down to 77 taxa. Two taxon files assign tribes or both tribes and subfamilies to the tips. ```{r, results="hide"} #load data data(Ericactree) data(Ericactribes) data(Ericacsubfams) ``` It is generally a good idea (especially with own data) to check whether the format of the data is correct: ```{r} #check data Ericactree head(Ericactribes) head(Ericacsubfams) ``` #### 3.1.2 Own Files #### If you want to analyze your own data (which should ultimately be the goal), you can load your phylogeny like this: ```{r, eval=FALSE} phy <- read.tree(file="/your_path/your_phylogeny.tre") ``` At this point, the package requires the input tree to be rooted, though it may be multifurcating; also, it can only handle single input trees so far. This may change in the future. Polytomies will be dealt with in a conservative way: if different taxa occur in a polytomy, it will be assumed that they are non-monophyletic. The taxonomy file should be a data frame with or without header and one row per tip in the phylogeny. The first column should be the tip names, any further column should assign the respective tips to taxonomic groups, in the format of one column per group (testing monophyly on several taxonomic levels is possible). If the file has a header, the names therein will be used to name the taxlevels within the function. If it is unknown or not applicable what group a tip belongs to on a certain taxonomic level, it should be encoded as 'NA'. You can load the taxonomy file like this: ```{r, eval=FALSE} your_clades <- read.csv(file="/your_path/your_taxonomy_file.csv", header=FALSE) ``` To check again whether the format of the data is correct: ```{r, eval=FALSE} #check data phy head(your_clades) ``` ### 3.2 Run Main Analysis ### The command for the main analysis-step is `AssessMonophyly`. This will check the monophyly of all taxa in the tree, identify which intruders and outliers are responsible for the non-monophyly of a taxon and determine their status for latter plotting. Depending on the number of tips in the tree and the number of monophyly-issues, this step will take more or less time. For the example files, this step should only take seconds, for a few thousand tips a couple of minutes, but for a conflict-rich tree with several 10k tips it could easily end up being a few hours. Accessing the information afterwards, however, is fairly quick. #### 3.2.1 Taxonomy Based on Tip Labels #### If we run the analysis using only the tree, the function will check whether the tip labels are in the format *Genus_species*, and if so, will extract the genus name from each tip label and use it as the taxonomic group associated to this tip. ```{r} solution0 <- AssessMonophyly(Ericactree) ``` The resulting output object will be a list, which will contain all the output information for all taxon levels in a nested fashion. We will explore in the following how to access this information effectively. #### 3.2.2 Taxonomy Based on File #### If the tip labels have a format different from *Genus_species*, or if we want to assess a different taxonomic level, we can do that by adding a taxonomy file to the command. The taxonomy file should be a simple table (data frame, maybe easiest based off a .csv file) with or without header, at least two columns and one row per tip. The first column contains the tip labels and all further columns (one or more) the names of the taxa associated to the respective tip, whereas every column stands for a different taxon level (the file has a header, its entries will be used to name the levels). The order of tip names in the file can be different from the order of the tip labels in the tree, but they have to contain the exact same names. If taxonomic levels are unknown for certain tips, they can be coded as NAs (but they cannot stay empty). Those tips will be considered when assessing monophyly of other groups, but the monophyly of the NA group will not be assessed. ```{r} solution1 <- AssessMonophyly(Ericactree, Ericacsubfams) ``` #### 3.2.3 Taxonomy Downloaded from Web #### If a taxonomy file cannot be generated easily, specifying taxonomy as `taxize` allows to use the package with the same name to download taxonomic names of groups indicated under `taxizelevel` from the online databases `'ncbi'`, `'itis'` or `'both'`, as set under `taxizedb` and `taxizepref`. The downloaded records will then be used like a taxonomy file. **Note:** While the function is implemented in a way to remove duplicates from both databases and only retain the ones that yielded a record, specifying `'both'` can lead to errors by the `taxize` package. Also be aware that - compared to the rest of the function - downloading taxonomic information can take a fair amount of time for medium sized trees already. #### 3.2.4 Other Options #### There are a few remaining specifications for this function. `verbosity` allows to customize the maximal number of intruders/outliers to be displayed in the results table. `outliercheck`, if set to `TRUE`, will let the function consider taxon members which are 'too far away' in the tree from the rest of their taxon as outliers, instead of all tips 'in between them' as intruders, thereby hopefully making the result biologically more meaningful. It will start to look for outliers if the ratio of taxon members in a clade is below the value specified as `outlierlevel` and will try and find a 'core clade' with a ratio higher than that. The arguments `taxask` and `taxverbose` allow to adjust the settings of the called `taxize` function `tax_name`: `taxverbose` allows to set whether the function should display which taxon is currently queried and whether an entry is found; `taxask` allows to set whether the function prompts the user to decide among several potential matches in the database, or to just pick the most likely one instead. The default values for this command are as follows: ```{r, eval=FALSE} AssessMonophyly <- function(tree, taxonomy=NULL, verbosity=5, outliercheck=TRUE, outlierlevel=0.5, taxizelevel= NULL, taxizedb='ncbi', taxizepref='ncbi', taxask=FALSE, taxverbose=FALSE) ``` ### 3.3 Accessing the Results ### #### 3.3.1 First Impression - Summary and Results Tables #### To get a first look on the results, we can pull the summary table out of the output object: ```{r} GetSummaryMonophyly(solution0) # pull out summary table ``` We can now see in the first column how many taxa are either monophyletic, non-monophyletic or monotypic (consisting of only one representative), as well as how many have outliers or are intruders to another taxon. In the second column, we can see how many of the tips fall in those respective categories. If we used a taxonomy file and had several taxonomic levels, the summary and result functions will by default return all of them. If we only want to look at a specific one, we can specify the number of the one we want under `taxlevels`. ```{r} GetResultMonophyly(solution1, taxlevels=2) # pull out summary only for taxlevel 2, subfamilies in this case ``` The result table shows us the analysis outcomes for each taxonomic unit considered. The row 'Monophyly' states for each taxon whether it is monophyletic, 'MRCA' gives the node number of the most recent common ancestor of all its members, '#Tips' gives the number of tips assigned to this taxon, and 'Delta-Tips' gives the number of tips that do not belong to that taxon, but which are also descendants of its MRCA. The columns '#Intruders' and '#Outliers' tell us how many other taxa or tips are intruders or outliers for this taxon respectively. 'Intruders' and 'Outliers' give us their names. #### 3.3.2 Further Look - Customized Access to Detailed Results #### In order to really dig into the analysis outputs, we can access information on which taxa and tips cause the monophyly issues as either intruders or outliers and which nodes have been inferred as MRCA of each taxon. When exploring intruders, we can either list them for all taxa or limit the list to taxa we're interested in, _e.g._ only for the genus _Erica_: ```{r} GetIntruderTaxa(solution0, taxa="Erica") # get list of genera that cause monophyly-issues for Erica ``` We learn that the genus _Bruckenthalia_ invades _Erica_, which turns out not to be problematic, as _Bruckenthalia_ is now classified to belong to _Erica_ as well. Similarly, if we have multiple taxonomic levels, we can limit the output to the one of interest: ```{r} GetIntruderTips(solution1, taxa="Ericoideae", taxlevels=2) # get list of species causing monophyly-issues for the subfamily Ericoideae ``` This works similarly for MRCA-nodes, outlier tips and outlier taxa (note that the later cannot be constrained to focal taxa, because a tip can only be an outlier to its own taxon). Instead of numbers or 'ALL', we could also select taxlevels using their names. ```{r} GetAncNodes(solution1, taxa = NULL, taxlevels='ALL') ``` **Note:** The `NA` for Oxydendreae reflects that it is monotypic and thus doesn't have an MRCA. ```{r} GetOutlierTaxa(solution0, taxlevels='ALL') GetOutlierTips(solution1, taxa = NULL, taxlevels='ALL') ``` **Note:** `list()` of course means that there aren't any outliers in Taxlevel_2. ### 3.4 Plotting ### Needless to say, the most intuitive way to get an impression of the issues in the monophyly within a phylogeny is to visually inspect it. Thus, there are a number of ways to look at our results as well. #### 3.4.1 Monophyly Plot #### The monophyly plot is the standard plot for this package. It will colour tips (and clades) according to their monophyly status, allowing to quickly spot areas of the tree where problems occur. The __standard colours__ are: - green: monophyletic - light purple: non-monophyletic - dark purple: intruder/outlier A number of other __ColorBrewer palettes__ are available (as listed in helpfile) and using __customized colours__ is possible as well. By default, the tree will be __ladderized__. The simplest way of plotting monophyly is like this: ```{r, fig.width=7, fig.height=14} PlotMonophyly(solution0, Ericactree, plot.type='monophyly', ladderize=TRUE, cex=0.5) ``` This shows us the monophyly situation on the genus level. It is easy to see how _Orthilia_ invades _Pyrola_, _Kalmiopsis_ invades _Phyllodoce_, _Bruckenthalia_ invades _Erica_ and how _Dimorphanthera_ and _Paphia_ reciprocally invade each other. We further see how _Vaccinium_ is invaded by _Agapetes_ and _Gaylussacia_, while having two outlier species of its own as well. Note how a more conservative ( _i.e._ higher) `outlierlevel` would tell us that only _Gaylussacia_ invades _Vaccinium_, which would then have _Vaccinium bracteatum_ as additional outlier instead. #### 3.4.2 Taxonomy Plot #### This rather auxiliary plot type allows viewing clades coloured by the taxon they belong to. Here we want to look at a different __taxonomic level__, the subfamilies, hence we can specify in the command to display `taxlevels=2` (as subfamilies was the second taxonomy column in our input file) and pick the output object `solution1`, which contains the results of the analysis incorporating the taxonomy file. The default colours are `rainbow`, which means that every taxon will get a different colour from the rainbow spectrum. ```{r, fig.width=7, fig.height=14} PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='taxonomy', cex=0.5) ``` This plot shows us nicely how the subfamilies are mostly monophyletic, apart from one rogue in red, which is _Rhodothamnus chamaecystus_, which notably was mislabeled as a member of the Vaccinioideae (whereas in reality - belonging to Ericoideae - it is in the right place). #### 3.4.3 Monophyly-Taxonomy Mirror Plot #### This plot type simply combines the two previous ones and displays them on two mirrored trees. We use the subfamilies again in this case. ```{r, fig.width=7, fig.height=14} PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='monoVStax', cex=0.4, label.offset=18) ``` Seeing the previous plot mirrored with the corresponding monophyly plot lets us understand how the whole Ericoideae and Cassiopoideae come out as intruders to Vaccinioideae: The intruder _Rhodothamnus chamaecystus_ pulls the MRCA node of the Vaccinioideae down to include both other subfamilies. Once again, a more conservative `outlierlevel` would have led to the biologically more meaningful result that only _Rhodothamnus_ is an outlier to Vaccinioideae (and an intruder to Ericoideae), while everything else is well. #### 3.4.4 Intruder Plot #### For a more condensed view on the monophyly issues in your tree and who is causing them, you might want to choose the intruder plot, which will especially highlight intruders and outliers and which taxon they belong to. In order to focus the plot on the issues even more, we set `monocoll=TRUE`, to __collapse monophyletic taxa__ to be represented by one tip each. This time, we display tribes. The default colouring is: - gray: monophyletic - black: non-monophyletic - rainbow: Intruders/Outliers ```{r, fig.width=7, fig.height=7} PlotMonophyly(solution1, Ericactree, taxlevels=1, plot.type='intruders', monocoll=TRUE, cex=0.5, label.offset=5) ``` Collapsing all monophyletic tribes slimmed this plot down and increased the overview. We immediately see that the issues are confined to the Vaccinieae and their sister clade, the Andromedeae, and the colouration reveals that _Andromeda polifolia_, _Vaccinium vitis-ideae_ and _Zenobia pulverulenta_ are causing it. Apparently, _Andromeda_ and the _Vaccinium_ are supposed to belong to the same tribe (Andromedeae), whereas _Zenobia_ should belong to a different one (being an intruder to Andromedeae from Vaccinieae). The solution behind this issue is that the tribe names assigned to _Vaccinium vitis-ideae_ and _Zenobia pulverulenta_ where (intentionally) switched. __Note__: Intruders are coloured using `rainbow` as well, but the palette will be newly created and the colours will not correspond to the ones in the taxonomy plot. #### 3.4.5 Solutions for Large Trees #### A difficulty for viewing phylogenies in R is that trees can be too large to see much detail in an R plotting window. To this end, we can choose to print the plot directly to a PDF file instead, in which we can then easily search and zoom into the desired portions of the tree. To achieve this, we simply set `PDF=TRUE` and set our desired filename in `PDF_filename='ourdesiredfilename.pdf'`. Space can additionally be saved by using `monocoll=TRUE`. The size of the PDF can be specified using `PDF_width` and `PDF_height`, the default for which is `"auto"`. **Note:** Be aware that plotting very large trees can take quite some time, in some cases longer than running the main analysis. Here two examples that are printed to PDF, first a monophyly plot with collapsed monophyletic clades on genus level: ```{r, eval=FALSE} PlotMonophyly(solution0, Ericactree, taxlevels=1, plot.type='monophyly', monocoll=TRUE, ladderize=TRUE, PDF=TRUE, PDF_filename='Monophylyplot_Ericac_Example.pdf') ``` ```{r, echo=FALSE, fig.width=7, fig.height=9} PlotMonophyly(solution0, Ericactree, taxlevels=1, plot.type='monophyly', monocoll=TRUE, ladderize=TRUE, cex=0.5, label.offset=5) ``` and second an intruder plot on subfamily level. A good idea for large trees is, to display them in circular shape using `type="fan"`: ```{r, eval=FALSE} PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='intruders', ladderize=TRUE, PDF=TRUE, PDF_filename='Intruderplot_Ericac_Example.pdf', type="fan") ``` ```{r, echo=FALSE, fig.width=7.5, fig.height=7.5} PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='intruders', ladderize=TRUE, type="fan", cex=0.4) ``` __Note__: If `PDF_height` and `PDF_width` are set to `"auto"`, the command will automatically scale the PDF document to be created according to the number of taxa in the tree. However, PDF files seem to be constrained to a maximum size of 200 inches, and the function will thus cap the size to that (both for automatic and manual sizes). If your tree is very large, the resulting PDF might be difficult to inspect, and you might want to consider chopping your tree up into subtrees to plot ( _e.g._ using `treeSlice` from the package `phytools`). Also, if the automatic scaling moves tip labels etc. in a way you don't like, you can adjust most parameters by specifying the respective arguments yourself. More details on that can be found in the helpfiles. ## 4 Final Notes ## Further information on all the functions of the package and their use can be found in the helpfiles `help(package=MonoPhy)`. For any further issues and questions send an email with subject 'MonoPhy support' to `schwery.macroevo@pm.me`. The data for the examples is based on: **Schwery, O.**, Onstein, R. E., Bouchenak-Khelladi, Y., Xing, Y., Carter, R. J. and Linder, H. P. (2015), As old as the mountains: the radiations of the Ericaceae. *New Phytologist*. doi: 10.1111/nph.13234 [Link to PDF](https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.13234) __________________________________________