--- title: "From species names to a phylogenetic posterior — prepR4pcm + pigauto" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{From species names to a phylogenetic posterior — prepR4pcm + pigauto} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # most of this requires datelife / pigauto / network ) ``` Phylogenetic comparative methods (PCMs) propagate uncertainty imperfectly. The standard pipeline takes one tree, fits one model, and reports one set of confidence intervals. But the tree itself is typically a point estimate from a posterior, the trait data typically have missing values, and each of those is its own source of uncertainty. This vignette walks through a pipeline that handles both: 1. **prepR4pcm** reconciles species names, fetches a *posterior of trees* (not just one), and prepares the data for downstream modelling. 2. **[pigauto](https://itchyshin.github.io/pigauto/)** imputes the missing trait values, fits the model on each posterior tree, and pools the results via Rubin's rules so the reported standard errors reflect both the tree and imputation uncertainty. The two packages compose: prepR4pcm produces `multiPhylo`, pigauto consumes `multiPhylo`. Once the contract is set up, swapping the backend (rtrees vs datelife vs fishtree) doesn't change the rest of the pipeline. ## What we'll build A model-ready dataset for fish, with: - A posterior of 50 fish chronograms from the Fish Tree of Life (Rabosky et al. 2018). - 5 imputations of the missing trait values per tree (so 5 × 50 = 250 model fits — adjust `m_per_tree` upward for a real analysis). - Pooled estimates with valid standard errors that account for both axes of uncertainty. ```{r setup} library(prepR4pcm) # Suggests packages we'll lean on: # install.packages("fishtree") # pak::pak("phylotastic/datelife") # Optional, for the pooling step at the end: # pak::pak("itchyshin/pigauto") ``` ## Step 1 — Retrieve a posterior of trees Start with whatever names appear in your dataset. They never quite match a tree's tip labels — formatting drifts, synonyms creep in. `pr_get_tree()` copes: it normalises every name (underscores, whitespace, authority strings) and, for the `fishtree` backend, resolves synonyms through Open Tree of Life's Taxonomic Name Resolution Service before querying the backend. ```{r data} # Your trait data — typically loaded from disk. Names need not be # tidy: `Esox_lucius` carries an underscore, which pr_get_tree() # normalises away. my_data <- data.frame( species = c("Salmo salar", "Esox_lucius", "Oncorhynchus mykiss"), body_size = c(98, NA, 50) # NA is fine; pigauto will impute it ) ``` Every backend that *can* return multiple trees does so via `n_tree > 1`. Here we ask `fishtree` for 50 stochastic resolutions (Chang, Rabosky, Alfaro 2019, *Syst. Biol.*). ```{r retrieve, eval = FALSE} trees <- pr_get_tree(my_data, species_col = "species", source = "fishtree", n_tree = 50) class(trees$tree) # "multiPhylo" length(trees$tree) # 50 trees$matched # the 3 species, in their original input form trees$unmatched # character(0) — every species was placed # Per-tree provenance: one list entry per returned tree, each # recording source_index, citation, calibration_method, and n_tips. length(trees$backend_meta$tree_provenance) # 50 trees$backend_meta$tree_provenance[[1]] # the provenance of tree 1 ``` `pr_get_tree()` records what happened to every name in `trees$mapping` (one row per input species), so a name that quietly fell out of the tree cannot go unnoticed. For the universal-but-slower option, use `source = "datelife"` — the chronogram-database backend that returns one calibrated tree per source (Hedges, Bininda-Emonds, Upham, etc.). ```{r datelife, eval = FALSE} trees <- pr_get_tree(my_data, species_col = "species", source = "datelife", n_tree = 50) # trees$tree is multiPhylo; per-source citations are in # trees$backend_meta$tree_provenance[[i]]$citation ``` If you already have a topology and want to *date* it (rather than fetch a new one), use `pr_date_tree()`: ```{r date, eval = FALSE} my_topology <- ape::read.tree("my_topology.nex") dated_trees <- pr_date_tree(my_topology, n_dated = 50) class(dated_trees$tree) # "multiPhylo" length(dated_trees$tree) # up to 50 ``` **What `n_dated = 50` actually returns**: 50 chronograms that all share `my_topology` but have *different branch lengths*. Each variant comes from a different source paper in DateLife's database (e.g. variant 1 dated against Hedges et al. 2015, variant 2 against Bininda-Emonds et al. 2007, etc.). The topology is fixed; the dating varies. To get 50 different *topologies*, fetch them separately first (e.g. `pr_get_tree(species, source = "rtrees", taxon = "mammal")` returns 100 mammal topologies sampled from VertLife / Upham et al. 2019), then pass that `multiPhylo` to `pr_date_tree()` to date each one. That gives you the topology × source cross-product. ## Step 2 — Reconcile the data to the posterior `pr_get_tree()` resolves names well enough to *retrieve* a tree, but the modelling step needs a data frame whose rows line up one-to-one with the tree's tips. `reconcile_tree()` matches your data against the retrieved topology, and `reconcile_apply()` returns the aligned data–tree pair. The 50 posterior trees share one tip set, so reconciling against the first tree aligns the data to all of them. ```{r reconcile, eval = FALSE} rec <- reconcile_tree(my_data, trees$tree[[1]], x_species = "species", authority = NULL) # tree tips are plain binomials rec # match summary: 1 exact, 2 normalized, 0 unresolved aligned <- reconcile_apply(rec, data = my_data, tree = trees$tree[[1]], species_col = "species", drop_unresolved = TRUE) aligned$data # one row per tree tip — the model-ready frame #> species body_size #> 1 Salmo salar 98 #> 2 Esox_lucius NA #> 3 Oncorhynchus mykiss 50 ``` `aligned$data` carries one row per species in the tree, with `Esox_lucius` keeping the `NA` that pigauto will impute; `aligned$tree` is the matching pruned tree. ## Step 3 — Format citations (do this *before* you submit) Reviewers ask. Save yourself the round-trip. ```{r cite, eval = FALSE} # Plain text for an email or a methods paragraph pr_cite_tree(trees) # Markdown for a GitHub issue or PR description cat(pr_cite_tree(trees, format = "markdown")) # BibTeX for a manuscript bibliography (sanity-check before submitting) cat(pr_cite_tree(trees, format = "bibtex")) ``` ## Step 4 — Hand the posterior to pigauto This is the contract. `pigauto::multi_impute_trees(df, trees, m_per_tree)` takes `df` (data with NAs) and `trees` (`multiPhylo`) and returns `m_per_tree` complete-data imputations *per tree*. The output knows which imputation came from which tree, and pigauto's pooling step later uses that index. ```{r pigauto, eval = FALSE} library(pigauto) # 5 imputations per tree × 50 trees = 250 complete datasets mi <- multi_impute_trees(aligned$data, trees = trees$tree, m_per_tree = 5) # Fit your model to each completed dataset. Replace the formula # with whatever your hypothesis is. fits <- with_imputations(mi, function(df) { glmmTMB::glmmTMB(body_size ~ 1 + (1 | species), data = df) }) # Pool via Rubin's rules — the standard errors reflect BOTH the # imputation uncertainty AND the tree-posterior uncertainty. pooled <- pool_mi(fits, coef_fun = function(fit) fixef(fit)$cond, vcov_fun = function(fit) vcov(fit)$cond) pooled ``` The reported intervals are wider than the equivalent single-tree-single-imputation pipeline — and *correctly* so. The narrower intervals were always overconfident. ## Choosing a backend Verified on a clean macOS R 4.4 install on 2026-05-01. See the [Comparing tree backends](comparing-tree-backends.html) vignette for the full status table. | If your taxon is... | And you want... | Use | Status | |---|---|---|---| | Universal | A single best-guess tree | `pr_get_tree(source = "rotl")` | ✅ verified — returns the Open Tree of Life synthesis tree | | Universal | A posterior of dated trees | `pr_get_tree(source = "datelife", n_tree = 50)` | likely ✅ — `datelife` is in `Enhances`; install separately with `pak::pak("phylotastic/datelife")` | | Birds | A single tree, current Clements | `pr_get_tree(source = "clootl", n_tree = 1)` | ✅ verified — works out of the box (uses the v1.6/2025 taxonomy bundled with `clootl`) | | Birds | A posterior of trees, current Clements | `pr_get_tree(source = "clootl", n_tree = 50)` | ⚠️ requires `clootl::get_avesdata_repo(".")` once before this works; without it `sampleTrees()` errors with `AvesData repo not found`. Capped at 100 upstream. | | Fish | Time-calibrated, posterior | `pr_get_tree(source = "fishtree", n_tree = 50)` | ✅ verified — returns `multiPhylo[50]` | | Bird, mammal, fish, amphibian, reptile, plant, shark, bee, butterfly | Taxon-specific mega-tree, possibly grafted | `pr_get_tree(source = "rtrees", taxon = "")` | ✅ works; ⚠️ `n_tree` is **informational only** — `rtrees::get_tree()` has no `n_tree` argument, so the count is fixed by the chosen mega-tree (`taxon = "bird"` → 100, `taxon = "mammal"` → 100, `taxon = "fish"` → 1, etc.). | For the dating-only case (you already have a topology): | If your topology is... | Use | Status | |---|---|---| | Universal taxa | `pr_date_tree(my_tree, n_dated = 50)` | likely ✅ — untested in this run; same dependency story as `datelife` | ## What's *not* in this pipeline - **Tree comparison across backends.** prepR4pcm provides `pr_tree_compare()` for quick pairwise Jaccard / Robinson-Foulds / branch-length comparisons (see the [comparing tree backends vignette](comparing-tree-backends.html)). For richer visual comparison use `phytools::cophyloplot()` or `phangorn::densiTree()`. - **Trait imputation alone.** That's pigauto's `impute()` / `multi_impute()`. This vignette skips straight to the trees-aware variant because that's the integration point. For repeated runs of the same retrieval (e.g. while iterating on a manuscript), pass `cache = TRUE` to `pr_get_tree()`. The cache lives under `tools::R_user_dir()` by default; redirect with `pr_tree_cache_dir()` if you want it inside the project. ## See also - `?pr_get_tree`, `?pr_date_tree`, `?pr_cite_tree`, `?reconcile_augment` --- the four entry points for prepR4pcm's tree-handling. - [pigauto's full PCM workflow vignette](https://itchyshin.github.io/pigauto/articles/mixed-types.html) --- the downstream half, in detail. - Sanchez Reyes et al. (2024). *DateLife: Leveraging databases and analytical tools to reveal the dated Tree of Life.* Systematic Biology 73(2): 470–485. DOI 10.1093/sysbio/syae015. - Rabosky et al. (2018). *An inverse latitudinal gradient in speciation rate for marine fishes.* Nature 559: 392–395. DOI 10.1038/s41586-018-0273-1.