Phylogenetic trees from morphological data

In this vignette, we will show how to work with morphological data in phangorn (Schliep 2011). In most cases the different morphological characters or character states are encoded with the numbers 0:9 (or less, if there are less differences). Morphological data can come in different formats. The most common ones are .csv and .nexus.

Load packages

We start by loading the phangorn package and setting a random seed:

library(phangorn)
set.seed(9)

Load data

The dataset we’re using contains morphological data for 12 mite species, with 79 encoded characters (Schäffer et al. 2010). When reading in the .csv file, row.names = 1 uses the first column (species) as row names. To get a phyDat object, we have to convert the dataframe into a matrix with as.matrix.

fdir <- system.file("extdata", package = "phangorn")
mm <- read.csv(file.path(fdir, "mites.csv"), row.names = 1)
mm_pd <- phyDat(as.matrix(mm), type = "USER", levels = 0:7)

The data can then be written into a nexus file:

write.phyDat(mm_pd, file.path(fdir, "mites.nex"), format = "nexus")

Reading in a nexus file is even easier than reading in a csv file:

mm_pd <- read.phyDat(file.path(fdir, "mites.nex"), format = "nexus", type = "STANDARD")

After reading in the nexus file, we have the states 0:9, but the data only has the states 0:7. Here is one possibility to change the contrast matrix:

contrast <- matrix(data = c(1,0,0,0,0,0,0,0,0,
    0,1,0,0,0,0,0,0,0,
    0,0,1,0,0,0,0,0,0,
    0,0,0,1,0,0,0,0,0,
    0,0,0,0,1,0,0,0,0,
    0,0,0,0,0,1,0,0,0,
    0,0,0,0,0,0,1,0,0,
    0,0,0,0,0,0,0,1,0,
    0,0,0,0,0,0,0,0,1,
    1,1,1,1,1,1,1,1,1),
    ncol = 9, byrow = TRUE)
dimnames(contrast) <- list(c(0:7,"-","?"),
    c(0:7, "-"))
contrast
##   0 1 2 3 4 5 6 7 -
## 0 1 0 0 0 0 0 0 0 0
## 1 0 1 0 0 0 0 0 0 0
## 2 0 0 1 0 0 0 0 0 0
## 3 0 0 0 1 0 0 0 0 0
## 4 0 0 0 0 1 0 0 0 0
## 5 0 0 0 0 0 1 0 0 0
## 6 0 0 0 0 0 0 1 0 0
## 7 0 0 0 0 0 0 0 1 0
## - 0 0 0 0 0 0 0 0 1
## ? 1 1 1 1 1 1 1 1 1
mm_pd <- phyDat(mm_pd, type="USER", contrast=contrast)

Now that we have our data, we can start the analyses.

Parsimony

For morphological data, one of the most frequently used approaches to conduct phylogenetic trees is maximum parsimony (MP). pratchet (as already described in Estimating phylogenetic trees with phangorn) implements the parsimony ratchet (Nixon 1999). To create a starting tree, we can use the function random.addition:

mm_start <- random.addition(mm_pd)

This tree can then be given to pratchet:

mm_tree <- pratchet(mm_pd, start = mm_start, minit = 1000, maxit = 10000,
                    all = TRUE, trace = 0)
mm_tree
## 19 phylogenetic trees

With all=TRUE we get all (in this case 19) trees with lowest parsimony score in a multiPhylo object. Since we we did a minimum of 1000 iterations, we already have some edge support. Now we can assign the edge lengths.

mm_tree <- acctran(mm_tree, mm_pd)

Branch and bound

In the case of our mites-dataset with 12 sequences, it’s also possible to use the branch and bound algorithm (Hendy and Penny 1982) to find all most parsimonious trees. With bigger datasets it is definitely recommended to use pratchet.

mm_bab <- bab(mm_pd, trace = 0)
mm_bab
## 37 phylogenetic trees

Root trees

If we want our unrooted trees to be rooted, we have the possibility to use midpoint to perform midpoint rooting. Rooting the trees with a specific species (we chose C. cymba here) can be done with the function root from the ape package (Paradis and Schliep 2019). To save the correct node labels (edge support), it’s important to set edgelabel=TRUE.

mm_tree_rooted <- root(mm_tree, outgroup = "C._cymba", resolve.root = TRUE,
                       edgelabel = TRUE)

Plot trees

With plotBS we plot a tree with their respective edge support. It is also possible to save the plots as .pdf (or various other formats, e.g. svg, png, tiff) file. digits is an argument to determine the number of digits shown for the bootstrap values.

# subsetting for tree nr. 9
plotBS(mm_tree_rooted[[9]], digits = 2)

# save plot as pdf
pdf(file = "mm_rooted.pdf")
plotBS(mm_tree_rooted[[9]], digits = 2)
dev.off()

Consensus tree

To look at the consensus tree of our 19 trees from pratchet, or of our 37 most parsimonious trees from bab, we can use the consensus function from ape.

# unrooted pratchet tree
mm_cons <- consensus(mm_tree)

# rooted pratchet tree
mm_cons_root <- consensus(mm_tree_rooted, rooted = TRUE)

# branch and bound, we root the consensus tree in the same step
mm_bab_cons <- root(consensus(mm_bab), outgroup = "C._cymba",
                    resolve.root = TRUE, edgelabel = TRUE)
plot(mm_cons, main="Unrooted pratchet consensus tree")
plot(mm_cons_root, main="Rooted pratchet consensus tree")
plot(mm_bab_cons, main="Rooted bab consensus tree")
Unrooted and rooted consensus trees of the mites dataset with MP.Unrooted and rooted consensus trees of the mites dataset with MP.Unrooted and rooted consensus trees of the mites dataset with MP.

Unrooted and rooted consensus trees of the mites dataset with MP.

We can clearly see that, as expected, the two rooted trees have the same topology.

Session info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phangorn_2.12.1 ape_5.8         rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-1      gtable_0.3.6      jsonlite_1.8.9    dplyr_1.1.4      
##  [5] compiler_4.4.2    tidyselect_1.2.1  Rcpp_1.0.13-1     parallel_4.4.2   
##  [9] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.10       fastmap_1.2.0    
## [13] lattice_0.22-6    ggplot2_3.5.1     R6_2.5.1          labeling_0.4.3   
## [17] generics_0.1.3    igraph_2.1.1      knitr_1.49        tibble_3.2.1     
## [21] maketools_1.3.1   munsell_0.5.1     pillar_1.9.0      bslib_0.8.0      
## [25] rlang_1.1.4       utf8_1.2.4        fastmatch_1.1-4   cachem_1.1.0     
## [29] xfun_0.49         quadprog_1.5-8    sass_0.4.9        sys_3.4.3        
## [33] cli_3.6.3         withr_3.0.2       magrittr_2.0.3    digest_0.6.37    
## [37] grid_4.4.2        lifecycle_1.0.4   nlme_3.1-166      vctrs_0.6.5      
## [41] evaluate_1.0.1    glue_1.8.0        farver_2.1.2      codetools_0.2-20 
## [45] ggseqlogo_0.2     buildtools_1.0.0  fansi_1.0.6       colorspace_2.1-1 
## [49] tools_4.4.2       pkgconfig_2.0.3   htmltools_0.5.8.1

References

Hendy, M. D., and D. Penny. 1982. “Branch and Bound Algorithms to Determine Minimal Evolutionary Trees.” Math. Biosc. 59: 277–90.
Nixon, K. 1999. “The Parsimony Ratchet, a New Method for Rapid Rarsimony Analysis.” Cladistics 15: 407–14.
Paradis, Emmanuel, and Klaus Schliep. 2019. “Ape 5.0: An Environment for Modern Phylogenetics and Evolutionary Analyses in r.” Bioinformatics 35 (3): 526–28. https://doi.org/10.1093/bioinformatics/bty633.
Schäffer, Sylvia, Tobias Pfingstl, Stephan Koblmüller, Kathrin A Winkler, Christian Sturmbauer, and Günther Krisper. 2010. “Phylogenetic Analysis of European Scutovertex Mites (Acari, Oribatida, Scutoverticidae) Reveals Paraphyly and Cryptic Diversity: A Molecular Genetic and Morphological Approach.” Molecular Phylogenetics and Evolution 55 (2): 677–88.
Schliep, Klaus Peter. 2011. “Phangorn: Phylogenetic Analysis in R.” Bioinformatics 27 (4): 592–93. https://doi.org/10.1093/bioinformatics/btq706.