--- title: "Fuzzy matching: methods, thresholds, and tuning" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fuzzy matching: methods, thresholds, and tuning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(eval = FALSE) ``` ## What fuzzy matching does (and does not do) When `taxify()` receives a name it cannot find by exact match, it falls back to fuzzy matching. Fuzzy matching computes a *string distance* between the input name and every candidate in the backbone, then returns the closest candidate whose distance falls below a threshold. The backbone is genus-blocked during this step: only names sharing the same genus are compared, which keeps the search fast even on backbones with millions of rows. String distance is a purely mechanical measure. It counts character-level edits (insertions, deletions, substitutions, and optionally transpositions) required to transform one string into another. A fuzzy match tells us that two strings are *spelled similarly*. It does not tell us anything about whether two names refer to the same biological entity. Fuzzy matching catches typos, transliteration errors, and OCR artefacts. It does not resolve taxonomic disagreements, and it cannot bridge the gap between common names and Latin binomials. This distinction matters for interpretation. A fuzzy match with a low distance (say 0.05) almost certainly corrects a minor typo. A fuzzy match with a high distance (say 0.18) might correct a larger OCR error, or it might have matched the wrong species entirely. The `fuzzy_dist` column in the output is there so we can tell these apart. The matching pipeline in taxify runs in a strict sequence: name cleaning first, then exact matching (case-sensitive, case-insensitive, Latin orthographic normalization, infraspecific-to-species fallback), and only then fuzzy matching on the names that survived all exact passes without a hit. Fuzzy matching never overrides an exact match. If the cleaned input matches a backbone name exactly, that result stands regardless of whether a closer fuzzy candidate might exist under a different spelling. This means the fuzzy matching step operates only on genuinely misspelled or garbled names. ## The three distance methods taxify supports three string distance algorithms, selected via the `fuzzy_method` argument. All three are computed at the C level inside vectra's `fuzzy_join()`, which runs genus-blocked comparisons in parallel via OpenMP. ### Damerau-Levenshtein (default, `fuzzy_method = "dl"`) Damerau-Levenshtein counts four edit operations, each costing 1: - **Insertion**: add a character (*Querus* to *Quercus*) - **Deletion**: remove a character (*Quercuss* to *Quercus*) - **Substitution**: replace one character with another (*Quarcus* to *Quercus*) - **Transposition**: swap two adjacent characters (*Qurecus* to *Quercus*) The transposition operation is what distinguishes Damerau-Levenshtein from plain Levenshtein. Transpositions are among the most common typos in hand-entered data, so treating them as a single edit (rather than two: a deletion plus an insertion) produces tighter distances for real-world errors. This is the default for good reason: it handles the most common failure modes with the smallest distance penalty. ### Levenshtein (`fuzzy_method = "levenshtein"`) Levenshtein supports only three operations: insertion, deletion, and substitution. A transposition like *Qurecus* to *Quercus* costs 2 edits (delete the *r*, insert *r* at the right position) instead of the 1 edit that Damerau-Levenshtein would assign. In practice this means Levenshtein is *stricter* than Damerau-Levenshtein for transposition-heavy errors and *identical* for everything else. The same threshold value will reject candidates that Damerau-Levenshtein would accept. Levenshtein is a reasonable choice when the input data comes from a controlled source (database export, curated checklist) and transpositions are rare. For OCR or hand-typed data, Damerau-Levenshtein is almost always better. ### Jaro-Winkler (`fuzzy_method = "jw"`) Jaro-Winkler is fundamentally different from the edit-distance methods. It computes a *similarity score* between 0 (completely different) and 1 (identical), then taxify converts this to a *distance* as `1 - similarity`. The algorithm gives extra weight to characters that match at the beginning of the string, which reflects a useful observation about taxonomic names: the genus is the most informative part, and prefix errors are rarer than epithet errors. Because Jaro-Winkler operates on a 0-to-1 scale by definition, only fractional thresholds are supported. Passing an integer threshold (like `fuzzy_threshold = 2`) with `fuzzy_method = "jw"` raises an error immediately. Jaro-Winkler can be useful for very short names (3-5 characters) where a single edit produces a large normalized distance under Damerau-Levenshtein, and for datasets where most errors are concentrated in the epithet rather than the genus. For general-purpose matching, Damerau-Levenshtein remains the safer default. ## How thresholds work The `fuzzy_threshold` argument controls how different two strings can be before the match is rejected. It operates in two modes, depending on its value. ### Fractional mode (0 < threshold < 1) The default threshold of `0.2` means: *normalized distance must not exceed 0.2*. Normalized distance is defined as: ``` normalized_distance = raw_edits / max(nchar(input), nchar(candidate)) ``` This scales with name length. A 5-character name (*Abies*) gets at most 1 edit at threshold 0.2, because `1 / 5 = 0.2`. A 12-character name (*Taraxacum off*-) gets at most 2 edits, because `2 / 12 = 0.167 < 0.2` but `3 / 12 = 0.25 > 0.2`. A 20-character name gets up to 4 edits. Here is the concrete arithmetic for a few representative names: | Input name | Length | Max edits at 0.2 | Max edits at 0.1 | Max edits at 0.3 | |:-----------------------|-------:|------------------:|------------------:|------------------:| | Poa annua | 9 | 1 | 0 | 2 | | Quercus robur | 13 | 2 | 1 | 3 | | Taraxacum officinale | 20 | 4 | 2 | 6 | | Achillea millefolium | 20 | 4 | 2 | 6 | | Brachypodium sylvaticum| 23 | 4 | 2 | 6 | The "max edits" column is `floor(length * threshold)`. In practice the comparison uses the floating-point ratio, not the floor, so a 9-character name with 2 edits gives `2/9 = 0.222`, which exceeds 0.2 and is rejected. ### Integer mode (threshold >= 1) When `fuzzy_threshold` is an integer (1, 2, 3, ...), it acts as an absolute cap on raw edit count, regardless of name length. `fuzzy_threshold = 2L` means: at most 2 edits, whether the name is 5 characters or 25 characters long. This mode is useful when we know the kind of errors in our data. If the input comes from an OCR pipeline that occasionally drops or doubles a single character, `fuzzy_threshold = 1L` captures those errors without over-matching on longer names. Integer thresholds are not supported for Jaro-Winkler, because that method does not count discrete edits. ```{r integer-threshold} # Allow exactly 1 edit, regardless of name length result <- taxify( c("Qurecus robur", "Achillea milefolium", "Poa anua"), fuzzy_threshold = 1L ) # "Qurecus robur" matches (1 transposition) # "Achillea milefolium" matches (1 deletion: ll -> l) # "Poa anua" matches (1 deletion: nn -> n) ``` ## What happens before fuzzy matching Before any distance computation, taxify runs a cleaning pipeline on the input names. This pipeline strips qualifiers (`cf.`, `aff.`, `s.l.`, `s.str.`), removes authorship strings (`L.`, `(Aiton) Sm.`), drops brackets and trailing numbers, collapses whitespace, and lowercases everything except the genus. The backbone names are already clean, so this step brings user input into the same format. Cleaning is aggressive enough that many names which look like they need fuzzy matching actually resolve by exact match once the noise is stripped. ```{r cleaning-before-matching} # All three resolve to the same clean form: "Quercus robur" result <- taxify(c( "Quercus robur L.", "Quercus robur (L.) Sm.", " Quercus robur " )) # match_type will be "exact" for all three (no fuzzy needed) ``` The pipeline also handles Latin orthographic normalization as a separate exact matching pass. Alternations like *ae/i* (*hirtaeformis* vs *hirtiformis*), *ph/f*, *rh/r*, *th/t*, and *ii/i* at word endings are normalized before comparison. These are not fuzzy matches; they appear as `exact_ci` in the output. Hybrid markers (the multiplication sign or standalone "x") are detected and stripped during cleaning. A name like `Quercus × hispanica` is cleaned to `Quercus hispanica` for matching, with the `is_hybrid` column set to `TRUE`. The upshot: fuzzy matching only runs on names that survived cleaning *and* failed all exact matching passes (case-sensitive, case-insensitive, Latin normalization, and infraspecific-to-species fallback). By the time fuzzy matching activates, the remaining names genuinely have character-level errors. ## Worked example 1: clean names that need no fuzzy matching A well-curated species list, possibly with authorship strings attached, will typically resolve entirely by exact match. Fuzzy matching runs but finds nothing to do. ```{r clean-names} clean_names <- c( "Quercus robur", "Pinus sylvestris", "Betula pendula", "Fagus sylvatica", "Acer pseudoplatanus" ) result <- taxify(clean_names) # All rows have match_type == "exact" table(result$match_type) # exact # 5 # fuzzy_dist is NA for all rows all(is.na(result$fuzzy_dist)) # TRUE ``` Adding authorship does not change the picture. The cleaning pipeline strips it before matching. ```{r clean-names-authorship} with_authors <- c( "Quercus robur L.", "Pinus sylvestris L.", "Betula pendula Roth", "Fagus sylvatica L.", "Acer pseudoplatanus L." ) result <- taxify(with_authors) table(result$match_type) # exact # 5 ``` The message here is straightforward: for curated data, fuzzy matching adds no value and can safely be disabled with `fuzzy = FALSE` to skip the step entirely. This saves a small amount of time on large lists. ## Worked example 2: OCR-degraded and hand-typed names Real-world species lists often arrive with typos, especially when transcribed from handwritten field notes or extracted from scanned PDFs via OCR. These are the names fuzzy matching is designed to rescue. ```{r ocr-degraded} messy_names <- c( "Qurecus robur", # transposition: ur -> ru "Taraxacum officianle", # transposition: al -> la "Plantago lanceoalata", # transposition: la -> al "Trifolium repnes", # transposition: en -> ne "Dactylis gloemrata", # transposition: me -> em "Lolium perrene", # insertion: extra r "Achillea millefolum", # deletion: i missing "Ranunculus acris" # correct (should exact-match) ) result <- taxify(messy_names) # Check what matched and how result[, c("input_name", "accepted_name", "match_type", "fuzzy_dist")] ``` The transposition errors (*Qurecus*, *officianle*, *lanceoalata*) each cost 1 edit under Damerau-Levenshtein, producing `fuzzy_dist` values around 0.07-0.08 for these 13-20 character names. The deletion in *millefolum* (missing *i*) also costs 1 edit. *Ranunculus acris* exact-matches and has `fuzzy_dist = NA`. Consider the arithmetic for *Taraxacum officianle* (20 characters). The intended target is *Taraxacum officinale*, which differs by a transposition of *a* and *l* at positions 18-19. That is 1 edit, giving a normalized distance of `1 / 20 = 0.05`. This falls well within the 0.2 threshold. Even a conservative threshold of 0.1 would accept it. The name *Lolium perrene* (14 characters) has an extra *r* compared to *Lolium perenne*, costing 1 insertion, for a normalized distance of `1 / 14 = 0.071`. All of these fall comfortably within the default threshold of 0.2. For data with this error profile, the default settings work well out of the box. The `fuzzy_dist` values cluster tightly around 0.05-0.08, giving us high confidence that every match is correct. ## Worked example 3: threshold too loose A loose threshold can match names to the wrong species. This is the primary risk of fuzzy matching, and it tends to bite hardest with short names or names in species-dense genera. ```{r threshold-too-loose} # Poa is a large genus with many similar epithets poa_names <- c( "Poa anua", # intended: Poa annua (1 edit) "Poa pratenss", # intended: Poa pratensis (1 edit) "Poa trialis" # intended: Poa trivialis (2 edits) ) # With a loose threshold, some may match the wrong species loose <- taxify(poa_names, fuzzy_threshold = 0.4) loose[, c("input_name", "accepted_name", "fuzzy_dist")] ``` At threshold 0.4, *Poa trialis* (10 characters) is allowed up to 4 edits. That is enough distance to reach not only *Poa trivialis* (the intended target, 2 edits) but potentially other Poa species that happen to be closer in string distance. With 500+ Poa species in WFO, the risk of a false match is real. The fix is simple: tighten the threshold. ```{r threshold-tightened} tight <- taxify(poa_names, fuzzy_threshold = 0.15) tight[, c("input_name", "accepted_name", "match_type", "fuzzy_dist")] # "Poa anua" still matches (1/9 = 0.11 < 0.15) # "Poa pratenss" still matches (1/12 = 0.08 < 0.15) # "Poa trialis" may fail (2/11 = 0.18 > 0.15), safer to leave unmatched ``` Names that fail fuzzy matching get `match_type = "none"`. An unmatched name is always better than a wrong match, because we can review unmatched names manually. A wrong match is silent and propagates into downstream analyses. This example also illustrates why short names in large genera are the hardest case for fuzzy matching. The genus *Poa* has over 500 accepted species in WFO, many with epithets that differ by only 2-3 characters (*pratensis* vs *palustris*, *trivialis* vs *trivialis*). The shorter the name, the fewer edits it takes to reach the threshold, and the more candidate species fall within range. For genera like Carex (2,000+ species), Astragalus (3,000+), or Euphorbia (2,000+), the same problem applies. When working with species-dense genera, tightening the threshold to 0.1-0.15 is almost always the right move. ## Worked example 4: comparing all three methods The same input list can produce different results depending on which distance method is used. The differences are most visible when the errors include transpositions. ```{r compare-methods} test_names <- c( "Qurecus robur", # transposition in genus "Achillea milefolium", # deletion (l dropped) "Plantago lanceoalata", # transposition in epithet "Betula pednula", # transposition in epithet "Fagus sylvatcia" # transposition in epithet ) dl_result <- taxify(test_names, fuzzy_method = "dl") lev_result <- taxify(test_names, fuzzy_method = "levenshtein") jw_result <- taxify(test_names, fuzzy_method = "jw") # Compare fuzzy_dist across methods comparison <- data.frame( input = test_names, dl_dist = dl_result$fuzzy_dist, lev_dist = lev_result$fuzzy_dist, jw_dist = jw_result$fuzzy_dist, dl_match = dl_result$match_type, lev_match = lev_result$match_type, jw_match = jw_result$match_type ) comparison ``` For a transposition like *Qurecus* to *Quercus*, Damerau-Levenshtein reports 1 edit (distance ~0.08 on a 13-character name). Levenshtein reports 2 edits (distance ~0.15). Both fall within the default 0.2 threshold, so both methods match it, but the Levenshtein distance is nearly double. For deletions like *milefolium* to *millefolium*, both methods report the same distance (1 edit), because no transposition is involved. Jaro-Winkler distances tend to be smaller overall because the algorithm rewards matching prefixes. A name that shares its entire genus prefix with the candidate starts with a high base similarity. The practical consequence is that Jaro-Winkler is more permissive at the same numeric threshold. A threshold of 0.2 under Jaro-Winkler is quite loose; 0.1 is a more comparable starting point. The table below shows approximate distances for the same errors across all three methods, assuming names of 13-20 characters: | Error type | Example | DL dist | Lev dist | JW dist | |:-------------------------|:---------------------------|--------:|---------:|--------:| | Single transposition | Qurecus → Quercus | 0.08 | 0.15 | 0.04 | | Single deletion | millefolum → millefolium | 0.05 | 0.05 | 0.03 | | Single substitution | Quarcus → Quercus | 0.08 | 0.08 | 0.05 | | Two transpositions | Plantgao → Plantago | 0.13 | 0.25 | 0.06 | The Levenshtein column is always equal to or larger than the Damerau-Levenshtein column, because Levenshtein charges double for transpositions. Jaro-Winkler is consistently the smallest, because the shared genus prefix dominates the similarity calculation. These numbers explain why the same threshold value behaves differently across methods and why we need to recalibrate when switching. ## The `fuzzy_dist` column Every row in the taxify output has a `fuzzy_dist` column. For exact matches (including case-insensitive and Latin normalization), this is `NA`. For fuzzy matches, it contains the normalized distance: a number between 0 and 1 where lower means closer. This column is the primary tool for quality control after fuzzy matching. A simple filter separates high-confidence matches from questionable ones. ```{r fuzzy-dist-filter} result <- taxify(my_species_list) # High-confidence fuzzy matches (likely just typos) good_fuzzy <- result[result$match_type == "fuzzy" & result$fuzzy_dist < 0.1, ] # Questionable fuzzy matches (review manually) check_fuzzy <- result[result$match_type == "fuzzy" & result$fuzzy_dist >= 0.1, ] ``` A `fuzzy_dist` below 0.1 on a name of 10+ characters means 1 edit at most. These are almost always correct. A `fuzzy_dist` between 0.1 and 0.2 means 1-3 edits depending on name length, and warrants a glance. Anything above 0.15 on a short name (under 10 characters) deserves scrutiny. For systematic review, sorting by `fuzzy_dist` in descending order puts the most suspect matches at the top. ```{r sort-by-dist} fuzzy_rows <- result[result$match_type == "fuzzy", ] fuzzy_rows <- fuzzy_rows[order(-fuzzy_rows$fuzzy_dist), ] head(fuzzy_rows[, c("input_name", "accepted_name", "fuzzy_dist")], 20) ``` In practice, most datasets have a bimodal distribution of `fuzzy_dist`: a peak near 0.05-0.08 (single typos on medium-length names) and a sparse tail above 0.12 (multiple errors or short names with one error). The tail is where false matches hide. A useful rule of thumb: if more than 5% of fuzzy matches have `fuzzy_dist` above 0.15, the threshold is probably too loose for the dataset. Either tighten it, or keep the current threshold but flag all matches above 0.12 for manual review. The cost of reviewing a few dozen names is small compared to the cost of propagating a wrong species identity through a trait analysis or distribution model. ## Genus-blocked matching and misspelled genera By default, fuzzy matching is *genus-blocked*: taxify extracts the genus from the input name and only compares against backbone entries with the same genus. This is fast (it avoids comparing every input against millions of candidates) and reduces false matches (a misspelled epithet cannot accidentally match a completely different genus). The downside is that a misspelled genus produces no match at all, because no backbone entries share the misspelled genus. taxify handles this with a second pass: after genus-blocked fuzzy matching, any names still unmatched are run through a *prefix-blocked* fuzzy join. This pass blocks on the first two characters of the name rather than the full genus. Most genus typos preserve the first two characters (*Qurecus* still starts with *Qu*, *Betual* still starts with *Be*), so the prefix block catches them while still pruning the search space substantially. This two-pass strategy means that genus typos are handled automatically. There is nothing to configure. The only case it misses is a typo in the first two characters of the genus, which is rare enough in practice that we accept the trade-off. One consequence worth knowing: a name with a misspelled genus will have a higher `fuzzy_dist` than a name with only an epithet typo, because the genus error adds edits on top of any epithet error. If the input is *Qeurcus robru* (two errors: genus transposition + epithet transposition), the total edit count is 2, giving a normalized distance of `2 / 13 = 0.154`. This still falls within the default threshold, but it lands in the zone where manual review is advisable. ## Practical guidance ### When to disable fuzzy matching For curated checklists, validated databases, or any input that has already been through a name-resolution service, fuzzy matching adds risk without benefit. Disable it. ```{r disable-fuzzy} result <- taxify(curated_list, fuzzy = FALSE) ``` This also makes the call faster, because the fuzzy join step is skipped entirely. On a list of 100,000 names, the difference can be several seconds. ### When to tighten the threshold Tighten below the default 0.2 when the input names are short (many 2-word names under 12 characters), when the genera are species-rich (Carex, Poa, Astragalus, Euphorbia), or when false matches would be costly (conservation assessments, regulatory lists). A threshold of 0.1 is a good conservative choice. It still catches single-character typos on names of 10+ characters but rejects matches that require 2+ edits on shorter names. ```{r tight-threshold} result <- taxify(short_grass_list, fuzzy_threshold = 0.1) ``` ### When to loosen the threshold Loosen above the default 0.2 when the input comes from OCR on degraded documents, when names have been transliterated across character encodings, or when completeness matters more than precision (an initial screening pass where unmatched names are expensive to follow up). A threshold of 0.25-0.3 is reasonable for OCR data; going above 0.3 is rarely justified. ```{r loose-threshold} result <- taxify(ocr_names, fuzzy_threshold = 0.25) # Then filter questionable matches: suspect <- result[result$fuzzy_dist > 0.15, ] ``` ### When to switch methods Stick with Damerau-Levenshtein (`"dl"`) unless there is a specific reason to change. Switch to Levenshtein (`"levenshtein"`) for controlled data where transpositions are unlikely and we want the stricter distance. Switch to Jaro-Winkler (`"jw"`) for very short names (3-6 characters, e.g., matching at genus level) where the prefix weighting helps, but remember to lower the threshold to 0.1 or below. ### Using integer thresholds for uniform error budgets When the error model is known (e.g., "our OCR pipeline drops or adds at most 1 character"), integer thresholds give direct control. `fuzzy_threshold = 1L` means exactly what it says: at most 1 edit, on any name of any length. This avoids the length-dependent behavior of fractional thresholds where a 5-character name gets 1 edit but a 25-character name gets 5 edits. ```{r integer-threshold-uniform} # Uniform 2-edit budget, regardless of name length result <- taxify(my_names, fuzzy_threshold = 2L) ``` Integer thresholds are not available for Jaro-Winkler. Passing an integer with `fuzzy_method = "jw"` raises an error. ### A two-pass workflow for messy data For datasets with unknown error rates (historical collections, aggregated multi-source lists), a two-pass approach gives the best of both worlds. First pass: run with a tight threshold and `fuzzy = TRUE` to get high-confidence matches. Second pass: extract the unmatched names, run them again with a looser threshold, and review the additional fuzzy matches manually. ```{r two-pass} # Pass 1: conservative pass1 <- taxify(my_names, fuzzy_threshold = 0.1) unmatched <- pass1$input_name[pass1$match_type == "none"] # Pass 2: permissive, for manual review pass2 <- taxify(unmatched, fuzzy_threshold = 0.25) needs_review <- pass2[pass2$match_type == "fuzzy", ] needs_review[, c("input_name", "accepted_name", "fuzzy_dist")] ``` This avoids the all-or-nothing choice between tight and loose thresholds. The bulk of the data gets matched at high confidence, and only the residual names get the looser treatment with explicit human oversight. ## Summary of output columns related to fuzzy matching | Column | Values | Meaning | |:-------------|:-------------------------------------------|:------------------------------------------------------| | `match_type` | `"exact"`, `"exact_ci"`, `"fuzzy"`, `"none"`, `"out_of_scope"` | How the name was matched. `"exact"` is case-sensitive, `"exact_ci"` includes case-insensitive and Latin normalization matches. | | `fuzzy_dist` | Numeric (0-1) or `NA` | Normalized string distance for fuzzy matches. `NA` for exact matches and unmatched names. | | `backend` | `"wfo"`, `"col"`, `"gbif"`, etc. | Which backbone provided the match. Useful in multi-backend fallback chains. | The `match_type` and `fuzzy_dist` columns together give a complete picture of match quality. Exact matches are definitive, and fuzzy matches with distance below 0.05 are near-certain corrections of minor typos. As distance climbs toward 0.15 and above, manual review becomes worthwhile because the matched name may belong to a different species entirely.