--- title: "Hybrid name detection and parsing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Hybrid name detection and parsing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Hybrid names in taxonomy Botanical nomenclature uses a dedicated marker for hybrids: the multiplication sign (×, U+00D7). This marker appears in three distinct positions, each signalling a different kind of hybrid. A **nothogenus** places the marker before the genus name, signalling an intergeneric hybrid (a cross between species in two different genera). Leyland cypress is a well-known example: > ×Cupressocyparis leylandii A **nothospecies** places the marker before the specific epithet, with the genus the same on both sides of the cross: > Mentha ×piperita Peppermint (*Mentha ×piperita*, a cross of *M. aquatica* and *M. spicata*) is the classic case. The third form, a **hybrid formula**, names both parent species explicitly, joined by the multiplication sign: > Salix alba × Salix fragilis In real-world data, the multiplication sign is frequently replaced by a lowercase or uppercase "x". Herbarium databases, spreadsheet exports, and OCR outputs rarely preserve the Unicode character. taxify accepts all three forms (`×`, `x`, `X`) and normalizes them internally. The detection logic distinguishes a standalone "x" used as a hybrid marker from an "x" that is part of a word (e.g., the genus *Saxifraga*) by requiring whitespace boundaries around the letter. ```{r} library(taxify) ``` ## How taxify detects hybrids Detection happens early in the pipeline, during name cleaning and before any backbone matching. When `taxify()` receives an input vector, each name passes through `clean_names()`, which calls the internal `detect_hybrid()` function. The function tokenizes the name, looks for the hybrid marker in specific positions, and classifies the result as nothogenus, nothospecies, formula, or non-hybrid. The output of `taxify()` includes an `is_hybrid` column (logical) that records whether a hybrid marker was found in the original input. This column is always present regardless of whether the name ultimately matched a backbone record. The finer classification into nothogenus, nothospecies, or formula is not exposed directly in the main output; it becomes available through `add_hybrid_info()`, which we cover below after looking at how matched hybrids behave in the result table. After detection, the hybrid marker is stripped from the name before matching. For a nothospecies like "Mentha ×piperita", the cleaned form becomes "Mentha piperita". For a hybrid formula like "Salix alba × Salix fragilis", only the first parent binomial ("Salix alba") is retained as the cleaned name, since formulas are not single taxon names and cannot match a backbone record directly. For nothospecies, taxify also constructs a secondary search form with the multiplication sign reinserted ("Mentha × piperita") and attempts to match that against the backbone. Some backbones store nothospecies with the × character in the canonical name, so this secondary attempt can recover matches that the stripped form misses. ## Worked example: matching a mixed species list Consider a list that includes ordinary species, a nothospecies, a nothogenus, and a hybrid formula. We pass them all to `taxify()` in a single call. ```{r} names <- c( "Quercus robur", "Mentha x piperita", "x Cupressocyparis leylandii", "Salix alba x Salix fragilis", "Platanus x hispanica" ) result <- taxify(names, backend = "wfo") result[, c("input_name", "accepted_name", "is_hybrid", "match_type")] ``` The expected output looks roughly like this: | input_name | accepted_name | is_hybrid | match_type | |:-----------------------------|:----------------------|:----------|:-----------| | Quercus robur | Quercus robur | FALSE | exact | | Mentha x piperita | Mentha × piperita | TRUE | exact | | x Cupressocyparis leylandii | NA | TRUE | none | | Salix alba x Salix fragilis | Salix alba | TRUE | exact | | Platanus x hispanica | Platanus × hispanica | TRUE | exact | Several things are visible here. The two nothospecies (Mentha, Platanus) matched successfully because WFO stores these as accepted names with the × character in the canonical name. The nothogenus ×Cupressocyparis returned no match because intergeneric hybrid genera are less commonly included in backbone databases. The hybrid formula matched only the first parent (Salix alba), since the formula itself is not a single taxon name. The `is_hybrid` column is TRUE for all four hybrid inputs, regardless of whether the name matched. This column records a property of the input, not of the match result. ## Extracting hybrid details with add_hybrid_info() The `add_hybrid_info()` function takes a `taxify()` result and parses the `input_name` column to extract structured hybrid information. It adds three columns: - `hybrid_parent_1`: the first parent binomial (for formulas) or NA - `hybrid_parent_2`: the second parent binomial (for formulas, with abbreviated genera expanded) or NA - `hybrid_type`: one of `"nothogenus"`, `"nothospecies"`, `"formula"`, or NA for non-hybrids For nothogenus and nothospecies names, both parent columns are NA because the input names only the hybrid itself, not its parents. The parent species of Mentha ×piperita (Mentha aquatica and Mentha spicata) are not encoded in the name string. Only hybrid formulas carry both parent names explicitly. ```{r} result |> add_hybrid_info() ``` The three new columns for our five-name example: | input_name | hybrid_type | hybrid_parent_1 | hybrid_parent_2 | |:-----------------------------|:--------------|:-----------------|:------------------| | Quercus robur | NA | NA | NA | | Mentha x piperita | nothospecies | NA | NA | | x Cupressocyparis leylandii | nothogenus | NA | NA | | Salix alba x Salix fragilis | formula | Salix alba | Salix fragilis | | Platanus x hispanica | nothospecies | NA | NA | ## Worked example: parsing hybrid formulas Hybrid formulas appear in botanical and horticultural datasets more often than one might expect. Field botanists record them when the parentage of a specimen is known or suspected. The formulas vary in notation: some spell out both genera in full, others abbreviate the second genus. ```{r} formulas <- c( "Salix alba x Salix fragilis", "Quercus pyrenaica x Q. petraea", "Populus nigra x Populus deltoides", "Rosa canina x R. gallica" ) formula_result <- taxify(formulas, backend = "wfo") formula_result <- formula_result |> add_hybrid_info() formula_result[, c("input_name", "hybrid_type", "hybrid_parent_1", "hybrid_parent_2")] ``` | input_name | hybrid_type | hybrid_parent_1 | hybrid_parent_2 | |:-----------------------------------|:------------|:-------------------|:--------------------| | Salix alba x Salix fragilis | formula | Salix alba | Salix fragilis | | Quercus pyrenaica x Q. petraea | formula | Quercus pyrenaica | Quercus petraea | | Populus nigra x Populus deltoides | formula | Populus nigra | Populus deltoides | | Rosa canina x R. gallica | formula | Rosa canina | Rosa gallica | The genus abbreviation "Q." in the second example was expanded to "Quercus" automatically. taxify infers the full genus from the first parent in the formula. The same expansion happened for "R." to "Rosa" in the fourth row. This expansion is purely textual: the first token of the first parent is used as the genus for the second parent whenever the second parent's genus field matches the pattern of a single capital letter followed by a period. ## What matches and what does not The three hybrid types have different matching profiles against backbone databases. **Nothospecies** are the best-supported form. WFO and COL both store many nothospecies as accepted names, with the × character as part of the canonical name. Mentha ×piperita, Platanus ×hispanica, and Narcissus ×medioluteus are examples that appear in both backbones. taxify's matching logic handles the marker correctly: it first tries the stripped form ("Mentha piperita") and then the form with the × reinserted ("Mentha × piperita"). At least one of these typically matches. **Nothogenera** have lower coverage. Intergeneric hybrids like ×Cupressocyparis, ×Triticosecale, and ×Festulolium exist in some backbones but are absent from others. WFO includes several nothogenera relevant to agriculture and horticulture. COL's coverage varies by taxonomic group. When a nothogenus does not match, the output row will have `match_type = "none"` and `accepted_name = NA`, but `is_hybrid` will still be TRUE. **Hybrid formulas** will not match a backbone record directly, because the formula is not a taxon name. taxify extracts the first parent binomial as the cleaned name for matching, so the result row reflects the match status of the first parent. To resolve both parents, match them separately. ```{r} # Match both parents of a hybrid formula separately parents <- c("Salix alba", "Salix fragilis") parent_result <- taxify(parents, backend = "wfo") ``` This approach gives a full match result (accepted name, synonym status, authorship) for each parent individually. In a dataset with many hybrid formulas, we can extract the parent columns from `add_hybrid_info()` and feed them back through `taxify()` as a batch. ```{r} # Batch-resolve all hybrid formula parents info <- result |> add_hybrid_info() formula_rows <- info[info$hybrid_type == "formula" & !is.na(info$hybrid_type), ] all_parents <- unique(na.omit(c( formula_rows$hybrid_parent_1, formula_rows$hybrid_parent_2 ))) parent_matches <- taxify(all_parents, backend = "wfo") ``` ## The multiplication sign and its substitutes The Unicode multiplication sign (U+00D7) is the correct character for hybrid notation under the International Code of Nomenclature. In practice, data arrive with three common representations: 1. The Unicode character itself: `×` (common in well-curated databases) 2. A lowercase `x` surrounded by spaces (common in spreadsheets and field data) 3. An uppercase `X` surrounded by spaces (less common, but occurs in older databases and OCR output) taxify normalizes all three forms internally. The `detect_hybrid()` function replaces every occurrence of U+00D7 with a space-padded "x" and then works with a uniform token stream, so the downstream logic only needs to handle one representation. The space-boundary requirement prevents false positives: "Saxifraga" does not trigger hybrid detection because the "x" sits within a word rather than standing alone between tokens. A subtlety arises with mojibake. When UTF-8 text containing the × character is read with a Latin-1 or Windows-1252 encoding, the two-byte sequence can be misinterpreted as "\u00c3\u0097" or "\u00c3\u2014". The name cleaning pipeline detects and repairs both of these common misreadings before hybrid detection runs, so names corrupted by encoding errors are still handled correctly. ## Practical notes **Which backbones have the most hybrids.** WFO has the broadest coverage of plant nothospecies and nothogenera, reflecting its focus on the world flora. COL includes hybrids across all kingdoms but coverage is uneven. GBIF aggregates data from many sources and includes hybrid names where the contributing checklists provide them. ITIS, NCBI, and OTT have minimal hybrid coverage. **Hybrid detection is input-side only.** taxify detects hybrids in the names that you supply. It does not scan the backbone for hybrid records. If a backbone stores "Mentha × piperita" as an accepted name, taxify will match your input against it, but the backbone record's own hybrid status is not exposed as a separate field. The `is_hybrid` column reflects your input, not the backbone. **Formulas with infraspecific ranks.** The parser expects binomials (genus plus epithet) on both sides of the × marker. Formulas that include subspecies or variety ranks (e.g., "Salix alba var. vitellina × Salix fragilis") will still be detected as formulas, but the parent extraction may include the rank and infraspecific epithet as part of the parent name. This is generally the desired behavior, since the full trinomial identifies the parent more precisely than the binomial alone. **Authorship in hybrid names.** Hybrid names sometimes carry authorship strings (e.g., "Mentha ×piperita L."). The name cleaning pipeline strips authorship before matching, so the presence of an author string does not interfere with hybrid detection or matching. ```{r} # Authorship is stripped; hybrid detection still works taxify("Mentha x piperita L.", backend = "wfo") ``` **Adding hybrid info is lightweight.** `add_hybrid_info()` operates entirely on the `input_name` column via string parsing. It does not re-query any backbone or access any files on disk. On a result with 10,000 rows, the function completes in milliseconds.