--- title: "Assembling mammal trait databases for phylogenetic comparative models" output: rmarkdown::html_vignette bibliography: refs_mammals.bib vignette: > %\VignetteIndexEntry{Assembling mammal trait databases for phylogenetic comparative models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} # This vignette uses dplyr / readr / stringr (and ape, which is in # Imports). The first three are in Suggests because they're only used # here, not by the R/ code itself. Set a single eval gate so every # chunk skips cleanly if any are absent -- the vignette still knits # and the rest of the package is unaffected. have_vignette_deps <- requireNamespace("dplyr", quietly = TRUE) && requireNamespace("readr", quietly = TRUE) && requireNamespace("stringr", quietly = TRUE) && requireNamespace("ape", quietly = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = have_vignette_deps ) if (!have_vignette_deps) { message( "This vignette requires dplyr, readr, stringr, and ape. ", "Skipping all code chunks; install the missing package(s) to ", "see the worked example." ) } ``` Comparative analyses often begin well before model fitting. Trait data are commonly assembled from multiple databases, each with its own species names, column names, measurement conventions, and taxonomic coverage. A phylogenetic tree adds one more requirement: the final trait table must contain species that can be matched to the tree, and the table and tree must be aligned before they can be used in a phylogenetic model. This vignette focuses on that database-assembly step. We combine three mammal trait datasets, reconcile their species names with a phylogenetic tree, and collapse the matched records into a species-level table. The goal is not to fit a model here, but to show how several databases and a tree can be brought into a single, clean object ready for downstream comparative analyses. ## Setup The worked example draws on five packages: **dplyr** for the table manipulations, **readr** for tolerant number parsing, **stringr** for tidying raw species strings, **ape** for the phylogeny, and **prepR4pcm** for reconciling species names against the tree. ```{r setup, message = FALSE} library(dplyr) library(readr) library(stringr) library(ape) library(prepR4pcm) ``` ```{r helper-functions, include = FALSE} pull_number_or_na <- function(df, col) { if (is.null(col) || !col %in% names(df)) { return(rep(NA_real_, nrow(df))) } readr::parse_number( as.character(df[[col]]), na = c("", "NA", "NaN", "-999", "-9999") ) } prep_source <- function(df, source_name, species_col, female_mass_col = NULL, adult_mass_col = NULL, litter_size_col = NULL, litter_y_col = NULL) { tibble::tibble( source = source_name, row_in_source = seq_len(nrow(df)), species = as.character(df[[species_col]]), female_mass_g = pull_number_or_na(df, female_mass_col), adult_mass_g = pull_number_or_na(df, adult_mass_col), litter_size_n = pull_number_or_na(df, litter_size_col), litters_per_year_n = pull_number_or_na(df, litter_y_col) ) |> mutate( species = stringr::str_squish(species), across( c(female_mass_g, adult_mass_g, litter_size_n, litters_per_year_n), ~ ifelse(is.finite(.x) & .x > 0, .x, NA_real_) ) ) |> filter(!is.na(species), species != "") } safe_sources <- function(x) { paste(sort(unique(stats::na.omit(x))), collapse = "; ") } safe_median <- function(x) { if (all(is.na(x))) { NA_real_ } else { stats::median(x, na.rm = TRUE) } } ``` ## Load the example sources The package bundles three small source tables and one small tree. The source tables were sampled from real database structures, using only the columns needed for this example. They mirror three common inputs for mammal comparative work: the Amniote life-history database [@Myhrvold2015], PanTHERIA [@Jones2009], and TetrapodTraits [@Moura2024]. The bundled tree is a subset of the VertLife mammal phylogeny [@Upham2019]; the next code chunk reports its exact tip count. For analysis-grade trees download the full credible set from . ```{r load-example-objects} data(mammal_amniote_example) data(mammal_pantheria_example) data(mammal_tetrapodtraits_example) data(mammal_tree_example) cat(sprintf("Amniote-like source: %d rows\n", nrow(mammal_amniote_example))) cat(sprintf("PanTHERIA-like source: %d rows\n", nrow(mammal_pantheria_example))) cat(sprintf("TetrapodTraits-like source: %d rows\n", nrow(mammal_tetrapodtraits_example))) cat(sprintf("Tree: %d tips\n", ape::Ntip(mammal_tree_example))) ``` ## Step 1: Compare the source tables The three sources contain related information, but they do not use the same column names. This is typical when assembling trait data from independent databases. ```{r inspect-inputs} source_columns <- tibble::tibble( source = c("Amniote", "PanTHERIA", "TetrapodTraits"), n_rows = c( nrow(mammal_amniote_example), nrow(mammal_pantheria_example), nrow(mammal_tetrapodtraits_example) ), n_columns = c( ncol(mammal_amniote_example), ncol(mammal_pantheria_example), ncol(mammal_tetrapodtraits_example) ), species_column = c("name", "MSW05_Binomial", "Scientific.Name") ) knitr::kable(source_columns) ``` ## Step 2: Standardise the sources We first convert each source into the same long-format structure. The result keeps one row per source record, plus a `source` column so the provenance of each value is retained. ```{r standardise-sources} amniote_std <- prep_source( mammal_amniote_example, source_name = "AMNIOTE", species_col = "name", female_mass_col = "female_body_mass_g", adult_mass_col = "adult_body_mass_g", litter_size_col = "litter_or_clutch_size_n", litter_y_col = "litters_or_clutches_per_y" ) pantheria_std <- prep_source( mammal_pantheria_example, source_name = "PANTHERIA", species_col = "MSW05_Binomial", adult_mass_col = "5-1_AdultBodyMass_g", litter_size_col = "15-1_LitterSize", litter_y_col = "16-1_LittersPerYear" ) tetrapodtraits_std <- prep_source( mammal_tetrapodtraits_example, source_name = "TETRAPODTRAITS", species_col = "Scientific.Name", adult_mass_col = "BodyMass_g", litter_size_col = "LitterSize" ) db_long_raw <- bind_rows( amniote_std, pantheria_std, tetrapodtraits_std ) knitr::kable(slice_head(db_long_raw, n = 10)) ``` We can now check how much information each source contributes. ```{r source-coverage} source_coverage <- db_long_raw |> group_by(source) |> summarise( n_records = n(), n_species = n_distinct(species), adult_mass_records = sum(!is.na(adult_mass_g)), female_mass_records = sum(!is.na(female_mass_g)), litter_size_records = sum(!is.na(litter_size_n)), litter_y_records = sum(!is.na(litters_per_year_n)), .groups = "drop" ) knitr::kable(source_coverage) ``` In this table `n_records` is the number of rows the source contributes and `n_species` the number of distinct species; each `*_records` column counts the rows where that trait has a non-missing, positive value. Those counts sit well below `n_records` — no single database measures every trait for every species, which is exactly why combining sources is worthwhile. ## Step 3: Reconcile species names with the tree Name reconciliation is done on the unique source names, not on every row of the trait database. This creates one matching decision per raw species name. ```{r species-lookup} species_lookup <- db_long_raw |> distinct(species) |> rename(species_raw = species) knitr::kable(slice_head(species_lookup, n = 10)) ``` We now reconcile the source names against the tree. External synonym lookup is turned off here so the vignette remains fast and reproducible. ```{r reconcile-pass-0} rec0 <- reconcile_tree( x = species_lookup, tree = mammal_tree_example, x_species = "species_raw", authority = NULL, fuzzy = FALSE, quiet = TRUE ) reconcile_summary(rec0, detail = "brief") ``` The mapping table records which source names matched the tree and which need review. In it, `name_x` is the raw source name, `name_y` the tree tip it resolved to, `match_type` records how the two were linked, and `in_x` / `in_y` flag whether the name is present in the data and in the tree. ```{r mapping-pass-0} mapping0 <- reconcile_mapping(rec0) mapping_preview <- mapping0 |> select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y"))) |> arrange(match_type, name_x) |> slice_head(n = 15) knitr::kable(mapping_preview) ``` Names that remain unresolved or flagged can be inspected separately. We also ask for suggested matches. These suggestions are not applied automatically; they are candidates for manual review. ```{r review-and-suggestions} review_names <- mapping0 |> filter(in_x, match_type %in% c("unresolved", "flagged")) |> arrange(match_type, name_x) if (nrow(review_names) == 0) { cat("No unresolved or flagged names in this example.\n") } else { cat(sprintf( "Showing 10 of %d unresolved or flagged names.\n\n", nrow(review_names) )) knitr::kable(slice_head(review_names, n = 10) |> select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y")))) } suggestions0 <- reconcile_suggest(rec0, n = 3, threshold = 0.9) suggestions_to_review <- suggestions0 |> transmute( name_x = unresolved, name_y = suggestion, score = score ) |> filter(score >= 0.9, score < 1) |> arrange(desc(score)) if (nrow(suggestions_to_review) == 0) { cat("No high-confidence, non-perfect suggestions were found.\n") } else { cat( "Showing up to 10 high-confidence suggested matches with score below 1.\n\n", sep = "" ) knitr::kable(slice_head(suggestions_to_review, n = 10), digits = 3) } ``` ## Step 4: Add manual corrections Automated reconciliation is useful, but some names still need human review. The suggestion table above helps identify likely matches. Manual corrections are stored as a small editable table. The important rule is simple: `name_x` must be a name from the trait database, and `name_y` must be an exact tip label in the tree. ```{r manual-overrides} manual_overrides <- suggestions_to_review |> slice_head(n = 2) |> mutate(user_note = "Accepted from high-confidence reconciliation suggestion") |> select(name_x, name_y, user_note) if (nrow(manual_overrides) == 0) { cat("No manual corrections were added in this example.\n") } else { knitr::kable(manual_overrides, digits = 3) } ``` We then apply any manual corrections to the reconciliation table. If `manual_overrides` is empty, the automated mapping is kept unchanged. ```{r apply-manual-overrides} mapping_final <- mapping0 |> left_join( manual_overrides |> rename(manual_name_y = name_y, manual_note = user_note), by = "name_x" ) |> mutate( species_tree = coalesce(manual_name_y, name_y), matched_to_tree = species_tree %in% mammal_tree_example$tip.label, match_type = if_else(!is.na(manual_name_y), "manual", match_type), notes = manual_note ) |> select(-manual_name_y, -manual_note) ``` ```{r final-reconciliation-summary} final_reconciliation_summary <- mapping_final |> filter(in_x) |> count(match_type, name = "n_names") |> arrange(desc(n_names), match_type) knitr::kable(final_reconciliation_summary) ``` The corrected mapping can now be joined back to the full source-level trait table. ```{r name-map-and-full-database} name_map <- mapping_final |> filter(in_x) |> transmute( species_raw = name_x, species_tree = species_tree, matched_to_tree = matched_to_tree, match_type = match_type, notes = notes ) db_full <- db_long_raw |> rename(species_raw = species) |> left_join(name_map, by = "species_raw") |> relocate(source, row_in_source, species_raw, species_tree, matched_to_tree, match_type) db_tree_matched <- db_full |> filter(matched_to_tree, !is.na(species_tree)) knitr::kable( db_tree_matched |> select(source, species_raw, species_tree, match_type, adult_mass_g, female_mass_g, litter_size_n, litters_per_year_n) |> slice_head(n = 10), digits = 3 ) ``` ## Step 5: Collapse to one row per species The source-level database can now be summarised to one record per matched species. Here we use the median trait value across available source records and keep simple provenance columns. ```{r species-summary} db_species_summary <- db_tree_matched |> group_by(species_tree) |> summarise( n_sources_total = n_distinct(source), sources = safe_sources(source), adult_mass_g = safe_median(adult_mass_g), female_mass_g = safe_median(female_mass_g), litter_size_n = safe_median(litter_size_n), litters_per_year_n = safe_median(litters_per_year_n), adult_mass_n_records = sum(!is.na(adult_mass_g)), female_mass_n_records = sum(!is.na(female_mass_g)), litter_size_n_records = sum(!is.na(litter_size_n)), litter_y_n_records = sum(!is.na(litters_per_year_n)), .groups = "drop" ) |> mutate(annual_offspring_n = litter_size_n * litters_per_year_n) ``` ```{r trait-coverage} trait_coverage <- db_species_summary |> summarise( n_species = n(), adult_mass_species = sum(!is.na(adult_mass_g)), female_mass_species = sum(!is.na(female_mass_g)), litter_size_species = sum(!is.na(litter_size_n)), litters_per_year_species= sum(!is.na(litters_per_year_n)), annual_offspring_species= sum(!is.na(annual_offspring_n)) ) knitr::kable(trait_coverage) ``` Each `*_species` column counts the species with a non-missing value for that trait. Some species still carry `NA`: even with three sources combined, not every species has every trait measured. Downstream comparative models handle these gaps through imputation or complete-case analysis. ## Step 6: Align the database and the tree For phylogenetic comparative models, the data and tree must refer to the same species. Here we prune the tree and order the data rows to match the tree tips. ```{r pcm-objects} matched_tips <- intersect( mammal_tree_example$tip.label, db_species_summary$species_tree ) tree_pcm <- keep.tip(mammal_tree_example, matched_tips) pcm_data <- db_species_summary |> filter(species_tree %in% tree_pcm$tip.label) |> mutate(species = species_tree) |> arrange(match(species, tree_pcm$tip.label)) |> relocate(species) stopifnot(identical(pcm_data$species, tree_pcm$tip.label)) ``` ```{r alignment-check} alignment_check <- tibble::tibble( object = c("pcm_data", "tree_pcm"), species_or_tips = c(nrow(pcm_data), ape::Ntip(tree_pcm)), aligned = c( identical(pcm_data$species, tree_pcm$tip.label), identical(pcm_data$species, tree_pcm$tip.label) ) ) knitr::kable(alignment_check) ``` ## Final database ready for model fitting The final objects are `pcm_data` and `tree_pcm`. The table below shows the first 10 rows of the assembled database. This is the object that would be passed to downstream phylogenetic comparative models. ```{r final-database-preview} knitr::kable( pcm_data |> select(species, adult_mass_g, litter_size_n, litters_per_year_n, annual_offspring_n, n_sources_total, sources) |> slice_head(n = 10), digits = 3 ) ``` The helper functions used in this vignette (`prep_source()`, `pull_number_or_na()`, `safe_sources()`, `safe_median()`) are local to this document; they are deliberately minimal so you can copy and adapt them for your own assembly script. If a generic version proves useful in practice we can graduate them to exported package functions. ## References