| Title: | Spatial and Taxonomic Data Utilities for Specimen Datasets |
|---|---|
| Description: | A management tool for specimen data ranging from public museum collections to private specimen repositories. The main types of data addressed are spatial (coordinates, longitude and latitude) and taxonomic data (ranking and nomenclature validity) with some additional options for user-determined dataset refinement. Combined or individual calls to the online repositories of the Global Biodiversity Information Facility (GBIF) via 'rgbif' and the Integrated Taxonomic Information System (ITIS) via 'taxize' enable built-in taxonomic checks. |
| Authors: | Bryson Y. Torgovitsky [aut, cre], Cheryl L. Ames [aut] |
| Maintainer: | Bryson Y. Torgovitsky <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-12 21:35:02 UTC |
| Source: | https://github.com/cran/datamuseum |
Biological Information System for Marine Life (BISMAL) Octopodoidea
occurrence records filtered to the Japan bounding box (latitude 25–50,
longitude 125–150) and standardized to the common column set shared
across all datamuseum Japan data sets.
SciName is constructed by combining Genus and
specificEpithet as no combined name field is present in the source.
Rows with NA in Source, Family, Genus,
specificEpithet, or Year are removed.
BISMAL_JapanBISMAL_Japan
A data frame with 473 rows and 12 variables:
Scientific name constructed from Genus and
specificEpithet. Trailing NA strings removed.
Genus name.
Family name.
Year of occurrence record, from year.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country name, from country.
State or province, from stateProvince.
Locality description, from locality.
Institution code, from institutionCode.
Museum lot identification code. Used for duplicate
detection in museum via deduplicate.
Specimen count per lot. Used for row expansion
in museum via duplicate.
Derived from the raw BISMAL occurrence download. Full source CSVs (raw, trimmed, and Japan-filtered) are available at https://github.com/btorgovitsky00/datamuseum.
Biological Information System for Marine Life (BISMAL). Downloaded 30 March 2026. https://www.godac.jamstec.go.jp/bismal/e/
The raw and trimmed intermediate versions of this data set are available as CSV files at https://github.com/btorgovitsky00/datamuseum.
museum for the combined data set including these records.
Removes duplicate rows from a data frame based on a specified ID column,
retaining the most complete row (fewest NA values) per ID group.
A record of all duplicate groups is attached to the result as an attribute.
deduplicate(data, id_col, drop_na = FALSE)deduplicate(data, id_col, drop_na = FALSE)
data |
A data frame. |
id_col |
Column name of the ID column to use for duplicate detection,
supplied either unquoted ( |
drop_na |
Logical. If |
Row completeness is computed as the count of non-NA values across
all columns using rowSums(!is.na(data)). When multiple rows tie on
completeness, which.max() retains the first occurrence.
Progress messages are printed to the console reporting the number of
NA ID rows removed (if drop_na = TRUE) and the total number
of duplicate rows removed.
A data frame with one row retained per unique value of id_col,
chosen by maximum row completeness (fewest NAs across all
columns). The original duplicate groups are accessible via
attr(result, "duplicates"), a data frame containing all rows that
were part of a duplicate group, with an additional logical column
.kept_row indicating which row was retained.
duplicate for the inverse operation of expanding rows by a
count column,
duplicated for simple duplicate detection,
distinct for dropping exact duplicate rows.
df <- data.frame( id = c(1, 2, 2, 3, 3), name = c("A", "B", NA, "C", "C"), score = c(90, 85, 85, 78, 78) ) # Retain the most complete row per ID deduplicate(df, id_col = id) # Inspect which rows were flagged as duplicates result <- deduplicate(df, id_col = id) attr(result, "duplicates") # Drop rows where the ID itself is NA before deduplication df_na <- data.frame( id = c(1, NA, 2, 2), value = c("a", "b", "c", "d") ) deduplicate(df_na, id_col = id, drop_na = TRUE)df <- data.frame( id = c(1, 2, 2, 3, 3), name = c("A", "B", NA, "C", "C"), score = c(90, 85, 85, 78, 78) ) # Retain the most complete row per ID deduplicate(df, id_col = id) # Inspect which rows were flagged as duplicates result <- deduplicate(df, id_col = id) attr(result, "duplicates") # Drop rows where the ID itself is NA before deduplication df_na <- data.frame( id = c(1, NA, 2, 2), value = c("a", "b", "c", "d") ) deduplicate(df_na, id_col = id, drop_na = TRUE)
Expands a data frame by repeating each row X number of times, specified by a count column. Useful for reconstructing individual-level data from aggregated or frequency-weighted data frames.
duplicate(data, n_col, drop_na = FALSE)duplicate(data, n_col, drop_na = FALSE)
data |
A data frame. |
n_col |
Column name of the column containing duplication counts,
supplied either unquoted ( |
drop_na |
Logical. If |
Expansion is performed via rep(seq_len(nrow(data)), times = n_col),
so the original row order is preserved within each group of duplicates.
NA counts are replaced with 1 prior to expansion when
drop_na = FALSE.
A console message reports the final row count of the expanded data frame.
A data frame with more rows than data, where each row i
appears n_col[i] times (or once if n_col[i] is NA
and drop_na = FALSE). Row names are not reset. The n_col
column is retained in the output.
deduplicate for the inverse operation,
rep for the underlying row repetition mechanism.
df <- data.frame( group = c("A", "B", "C"), value = c(10, 20, 30), n = c(3, 1, 2) ) # Expand so each row repeats n times duplicate(df, n_col = n) # NA counts default to 1 repetition df_na <- data.frame( group = c("A", "B", "C"), n = c(2, NA, 3) ) duplicate(df_na, n_col = n) # Drop rows with NA counts instead duplicate(df_na, n_col = n, drop_na = TRUE)df <- data.frame( group = c("A", "B", "C"), value = c(10, 20, 30), n = c(3, 1, 2) ) # Expand so each row repeats n times duplicate(df, n_col = n) # NA counts default to 1 repetition df_na <- data.frame( group = c("A", "B", "C"), n = c(2, NA, 3) ) duplicate(df_na, n_col = n) # Drop rows with NA counts instead duplicate(df_na, n_col = n, drop_na = TRUE)
Global Biodiversity Information Facility (GBIF) Octopodoidea occurrence
records filtered to the Japan bounding box (latitude 25–50,
longitude 125–150) and standardized to the common column set shared
across all datamuseum Japan data sets.
Rows with NA in Source, Family, Genus,
SciName,or Year are removed.
GBIF_JapanGBIF_Japan
A data frame with 798 rows and 12 variables:
Scientific name as recorded in GBIF, taken directly
from the species field. Trailing NA strings removed.
Genus name.
Family name.
Year of occurrence record, from year.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country code, from countryCode.
State or province, from stateProvince.
Locality description, from locality.
Institution code, from institutionCode.
Museum lot identification code. Used for duplicate
detection in museum via deduplicate.
Specimen count per lot. Used for row expansion
in museum via duplicate.
The raw and trimmed intermediate versions of this data set are available as CSV files in the package data repository. Note that those files contain non-ASCII characters in locality and collector name fields, reflecting the international scope of GBIF occurrence records.
Derived from the raw GBIF occurrence download. Full source CSVs (raw, trimmed, and Japan-filtered) are available at https://github.com/btorgovitsky00/datamuseum.
Global Biodiversity Information Facility (GBIF). GBIF.org (30 March 2026) GBIF Occurrence Download. https://www.gbif.org doi:10.15468/dl.2379hj
The raw and trimmed intermediate versions of this data set are available as CSV files at https://github.com/btorgovitsky00/datamuseum.
museum for the combined data set including these records.
Invert-E-Base (InvBase) Octopodoidea occurrence records filtered
to the Japan bounding box (latitude 25–50, longitude 125–150) and
standardized to the common column set shared across all datamuseum
Japan data sets.
SciName is constructed by combining Genus and
specificEpithet as no combined name field is present in the source.
Rows with NA in Source, Family, Genus,
specificEpithet, or Year are removed.
InvBase_JapanInvBase_Japan
A data frame with 43 rows and 12 variables:
Scientific name constructed from Genus and
specificEpithet. Trailing NA strings removed.
Genus name.
Family name.
Year of occurrence record, from year.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country name, from country.
State or province, from stateProvince.
County field, from county.
Institution code, from institutionCode.
Museum lot identification code. Used for duplicate
detection in museum via deduplicate.
Specimen count per lot. Used for row expansion
in museum via duplicate.
Derived from the raw InvBase occurrence download. Full source CSVs (raw, trimmed, and Japan-filtered) are available at https://github.com/btorgovitsky00/datamuseum.
Invert-E-Base. Downloaded 30 March 2026. https://invertebase.org
The raw and trimmed intermediate versions of this data set are available as CSV files at https://github.com/btorgovitsky00/datamuseum.
museum for the combined data set including these records.
Converts taxonomic names in one or more columns to plotmath italic
expressions suitable for use in ggplot2 axis labels or legends
via ggplot2::label_parsed. A new column named
<column>_italic is appended to the data frame for each
input column.
italicize(data, columns, drop_na = FALSE)italicize(data, columns, drop_na = FALSE)
data |
A data frame. |
columns |
Column name or |
drop_na |
Logical. If |
The _italic columns are intended to be mapped to a ggplot2
aesthetic (e.g. aes(x = Species_italic)) and rendered as parsed
expressions by passing label_parsed to the
labels argument of the corresponding scale. This keeps names as
plain character data until the plot is rendered, avoiding manual
expression() calls.
Authorship strings appended by taxon_cite in the format
"Genus species (Author, Year)" are automatically detected and
rendered in roman type alongside the italic canonical name, producing
expressions of the form italic("Genus species")~"(Author, Year)".
Spaces in names are replaced with ~ prior to wrapping in
italic(), which is required for plotmath to render multi-word
names (e.g. genus + species) correctly.
The input data frame with one additional character column appended per
entry in columns, named <column>_italic. Each new column
contains a plotmath expression of the form "italic(Genus~species)",
where spaces are replaced with ~ to preserve word spacing when
rendered. NA values in the source column remain NA in the
output column when drop_na = FALSE.
taxon_cleaner for standardising taxonomic name formatting
before italicising,
taxon_combine for merging genus and epithet into a binomial
name before italicising,
taxon_validate for validating taxonomic names before
italicising,
taxon_spellcheck for correcting misspellings before
italicising,
taxon_cite for appending authorship in the format detected
and rendered by this function,
label_parsed for rendering plotmath expressions in
ggplot2 scales.
df <- data.frame( SciName = c("Homo sapiens", "Panthera leo", "Canis lupus"), count = c(120, 45, 78) ) # Italicize a single column df <- italicize(df, SciName) df$SciName_italic # Use in a ggplot2 axis with parsed labels if (requireNamespace("ggplot2", quietly = TRUE)) { ggplot2::ggplot(df, ggplot2::aes(x = SciName_italic, y = count)) + ggplot2::geom_col() + ggplot2::scale_x_discrete(labels = ggplot2::label_parsed) } # Italicize multiple columns at once df2 <- data.frame( genus = c("Homo", "Panthera"), species = c("sapiens", "leo") ) italicize(df2, c(genus, species)) # Drop rows where the name column is NA df_na <- data.frame( SciName = c("Homo sapiens", NA, "Canis lupus"), count = c(10, 5, 8) ) italicize(df_na, SciName, drop_na = TRUE)df <- data.frame( SciName = c("Homo sapiens", "Panthera leo", "Canis lupus"), count = c(120, 45, 78) ) # Italicize a single column df <- italicize(df, SciName) df$SciName_italic # Use in a ggplot2 axis with parsed labels if (requireNamespace("ggplot2", quietly = TRUE)) { ggplot2::ggplot(df, ggplot2::aes(x = SciName_italic, y = count)) + ggplot2::geom_col() + ggplot2::scale_x_discrete(labels = ggplot2::label_parsed) } # Italicize multiple columns at once df2 <- data.frame( genus = c("Homo", "Panthera"), species = c("sapiens", "leo") ) italicize(df2, c(genus, species)) # Drop rows where the name column is NA df_na <- data.frame( SciName = c("Homo sapiens", NA, "Canis lupus"), count = c(10, 5, 8) ) italicize(df_na, SciName, drop_na = TRUE)
Detects latitude, longitude, and combined coordinate columns based on
value ranges and format matching. Supports decimal degree, DMS
(DDdeg MM'SS''}), and base-60 (\code{DDdeg MM') coordinate formats.
Returns a named list with elements combined, latitude, and
longitude, each containing the names and indices of detected columns.
latlong_column(data, sep = ",")latlong_column(data, sep = ",")
data |
A data frame to search for coordinate columns. |
sep |
Character. Separator used to split combined coordinate columns
(i.e. columns containing both latitude and longitude as a single string).
Default is |
Detection follows this priority order: combined columns are identified first, then latitude, then longitude. A column can only appear in one category — latitude candidates are excluded from longitude, and combined candidates are excluded from both.
The three recognised coordinate formats are:
Decimal degrees: "-12.345" or "51.5"
DMS: "12°34'56''N"
Base-60: "12°34'N"
Detection is based on value content, not column names, so columns with
non-standard names (e.g. "point_y") will still be detected provided
their values match a recognized coordinate format and pass the range check.
At least one valid coordinate value is required for a column to be
detected.
A named list with three elements:
combinedColumns containing both latitude and longitude as
a single delimited string (e.g. "51.5,-0.1").
latitudeColumns containing latitude values only. Numeric
range is constrained to [-90, 90].
longitudeColumns containing longitude values only. Numeric
range is constrained to [-180, 180]. Columns already assigned to
combined or latitude are excluded.
Each element is a named list mapping column name to column index in
data. An empty named list (list()) is returned for any
coordinate type not detected.
latlong_format for checking the coordinate format of detected
columns,
latlong_filter for removing invalid coordinates from detected
columns.
st_as_sf for converting detected coordinate columns to
an sf spatial object,
geocode for adding coordinates to a data frame
from address strings.
df <- data.frame( id = 1:6, lat = c(51.5, 48.8, 40.7, 35.6, -33.9, 55.8), lon = c(-0.1, 2.3, -74.0, 139.7, 151.2, 37.6) ) # Detect separate latitude and longitude columns latlong_column(df) # Access detected column names and indices result <- latlong_column(df) result$latitude result$longitude # Detect a combined coordinate column df2 <- data.frame( id = 1:6, coords = c("51.5,-0.1", "48.8,2.3", "40.7,-74.0", "35.6,139.7", "-33.9,151.2", "55.8,37.6") ) latlong_column(df2) # Use a custom separator for combined columns df3 <- data.frame( coords = c("51.5;-0.1", "48.8;2.3", "40.7;-74.0", "35.6;139.7", "-33.9;151.2", "55.8;37.6") ) latlong_column(df3, sep = ";")df <- data.frame( id = 1:6, lat = c(51.5, 48.8, 40.7, 35.6, -33.9, 55.8), lon = c(-0.1, 2.3, -74.0, 139.7, 151.2, 37.6) ) # Detect separate latitude and longitude columns latlong_column(df) # Access detected column names and indices result <- latlong_column(df) result$latitude result$longitude # Detect a combined coordinate column df2 <- data.frame( id = 1:6, coords = c("51.5,-0.1", "48.8,2.3", "40.7,-74.0", "35.6,139.7", "-33.9,151.2", "55.8,37.6") ) latlong_column(df2) # Use a custom separator for combined columns df3 <- data.frame( coords = c("51.5;-0.1", "48.8;2.3", "40.7;-74.0", "35.6;139.7", "-33.9;151.2", "55.8;37.6") ) latlong_column(df3, sep = ";")
Merges separate latitude and longitude columns into a single combined
coordinate column, appended to the data frame. The inverse of this
operation is latlong_split. Note that combined columns are
not accepted by the functions latlong_range and
latlong_region; use latlong_split to separate
columns again before filtering.
latlong_combine( data, latitude, longitude, new_column = "latlong", sep = ", ", drop_na = FALSE )latlong_combine( data, latitude, longitude, new_column = "latlong", sep = ", ", drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Column name of the latitude column, supplied either
unquoted ( |
longitude |
Column name of the longitude column, supplied either
unquoted ( |
new_column |
Column name for the new combined column, supplied either
unquoted ( |
sep |
Character. Separator inserted between latitude and longitude
values in the combined column. Default is |
drop_na |
Logical. If |
Both coordinate columns are coerced to character before concatenation via
paste0(), so numeric, integer, and character coordinate columns are
all accepted. No validation of coordinate ranges or formats is performed;
use latlong_format to check column formats before combining.
A console message reports the number of rows removed when
drop_na = TRUE.
The input data frame with one additional character column appended, named
according to new_column, containing values of the form
"<latitude><sep><longitude>". The original latitude and longitude
columns are retained. If drop_na = FALSE, rows with NA in
either coordinate column will produce "NA<sep>NA" or
"<value><sep>NA" in the combined column.
latlong_split for splitting a combined coordinate column back
into separate latitude and longitude columns,
latlong_format for checking coordinate formats before
combining.
df <- data.frame( id = 1:4, lat = c(51.5, 48.8, 40.7, 35.6), lon = c(-0.1, 2.3, -74.0, 139.7) ) # Combine with default separator and column name latlong_combine(df, latitude = lat, longitude = lon) # Use a custom separator and column name latlong_combine(df, latitude = lat, longitude = lon, new_column = coords, sep = ";") # Drop rows where either coordinate is NA df_na <- data.frame( lat = c(51.5, NA, 40.7), lon = c(-0.1, 2.3, NA) ) latlong_combine(df_na, latitude = lat, longitude = lon, drop_na = TRUE)df <- data.frame( id = 1:4, lat = c(51.5, 48.8, 40.7, 35.6), lon = c(-0.1, 2.3, -74.0, 139.7) ) # Combine with default separator and column name latlong_combine(df, latitude = lat, longitude = lon) # Use a custom separator and column name latlong_combine(df, latitude = lat, longitude = lon, new_column = coords, sep = ";") # Drop rows where either coordinate is NA df_na <- data.frame( lat = c(51.5, NA, 40.7), lon = c(-0.1, 2.3, NA) ) latlong_combine(df_na, latitude = lat, longitude = lon, drop_na = TRUE)
Converts coordinate columns between decimal degrees, DMS
(DDdeg MM'SS''}), and base-60 (\code{DDdeg MM') formats. Handles separate
latitude and longitude columns as well as combined coordinate columns.
Columns are converted in place; combined columns may optionally be split
into separate latitude and longitude columns.
latlong_convert( data, column, convert = "decimal", type = "auto", drop_na = FALSE, split_combined = FALSE )latlong_convert( data, column, convert = "decimal", type = "auto", drop_na = FALSE, split_combined = FALSE )
data |
A data frame containing coordinate columns. |
column |
Column name or |
convert |
Character. Target coordinate format. One of
|
type |
Character. Coordinate type used for direction suffix assignment
in DMS and base-60 output. One of |
drop_na |
Logical. If |
split_combined |
Logical. If |
All input formats are first parsed to decimal degrees internally before
conversion. The parser handles decimal, DMS, and base-60 formats,
inferring sign from cardinal direction suffixes (S, W) or
the sign of the degree value. Zero-width and BOM characters are stripped
before parsing.
Combined columns are detected automatically by the presence of two or more
numeric components in a single value. The separator used to split combined
columns is assumed to be ","; use latlong_split if a
custom separator is needed before converting.
Direction suffixes (N, S, E, W) are appended
to DMS and base-60 output when type can be determined. If
type = "auto" and the type cannot be inferred from the column name
or value range, no suffix is added.
Use latlong_format to check input formats before conversion,
and latlong_filter to remove out-of-range values beforehand.
The input data frame with converted coordinate columns. For non-combined
columns, the original column is overwritten with the converted values.
For combined columns, behavior depends on split_combined:
If FALSE, the combined column is overwritten with converted
values in the same delimited format.
If TRUE, two new columns are appended — <column>_lat
and <column>_lon — and the original column is retained.
When convert = "decimal", output columns are numeric. For
"dms" and "base60", output columns are character.
latlong_format for checking coordinate formats before
conversion,
latlong_filter for removing out-of-range coordinates before
conversion,
latlong_split for splitting combined columns with a custom
separator prior to conversion,
latlong_combine for merging separate coordinate columns after
conversion,
latlong_range for filtering to a bounding box after
converting to decimal degrees,
latlong_region for filtering to named geographic regions
after converting to decimal degrees.
df <- data.frame( id = 1:4, lat = c(51.5, 48.8, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, -74.0) ) # Convert decimal columns to DMS latlong_convert(df, c(lat, lon), convert = "dms") # Convert to base-60 latlong_convert(df, c(lat, lon), convert = "base60") # Convert a single column, specifying coordinate type explicitly latlong_convert(df, lat, convert = "dms", type = "lat") # Convert a combined column and keep as single column df_combined <- data.frame( coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2") ) latlong_convert(df_combined, coords, convert = "dms") # Convert a combined column and split into separate lat/lon columns latlong_convert(df_combined, coords, convert = "decimal", split_combined = TRUE) # Drop rows that produce NA after conversion df_na <- data.frame( lat = c(51.5, NA, 40.7), lon = c(-0.1, 2.3, -74.0) ) latlong_convert(df_na, c(lat, lon), convert = "dms", drop_na = TRUE)df <- data.frame( id = 1:4, lat = c(51.5, 48.8, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, -74.0) ) # Convert decimal columns to DMS latlong_convert(df, c(lat, lon), convert = "dms") # Convert to base-60 latlong_convert(df, c(lat, lon), convert = "base60") # Convert a single column, specifying coordinate type explicitly latlong_convert(df, lat, convert = "dms", type = "lat") # Convert a combined column and keep as single column df_combined <- data.frame( coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2") ) latlong_convert(df_combined, coords, convert = "dms") # Convert a combined column and split into separate lat/lon columns latlong_convert(df_combined, coords, convert = "decimal", split_combined = TRUE) # Drop rows that produce NA after conversion df_na <- data.frame( lat = c(51.5, NA, 40.7), lon = c(-0.1, 2.3, -74.0) ) latlong_convert(df_na, c(lat, lon), convert = "dms", drop_na = TRUE)
Removes rows where coordinates fall outside valid geographic ranges
([-90, 90] for latitude, [-180, 180] for longitude).
Accepts either separate latitude and longitude columns, or a single combined
coordinate column. Supports decimal degree, DMS (DDdeg MM'SS''}), and
base-60 (\code{DDdeg MM') coordinate formats.
latlong_filter( data, latitude = NULL, longitude = NULL, combined_col = NULL, sep = ",", drop_na = FALSE )latlong_filter( data, latitude = NULL, longitude = NULL, combined_col = NULL, sep = ",", drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Optional. Column name of the latitude column, supplied
either unquoted ( |
longitude |
Optional. Column name of the longitude column, supplied
either unquoted ( |
combined_col |
Optional. Column name of a combined coordinate column
containing latitude and longitude as a single delimited string
(e.g. |
sep |
Character. Separator used to split |
drop_na |
Logical. If |
All coordinate formats are parsed to decimal degrees internally before
range validation. The parser handles decimal, DMS, and base-60 formats,
inferring sign from cardinal direction suffixes (S, W) or
the sign of the degree value. Zero-width and BOM characters are stripped
before parsing.
Either combined_col or both latitude and longitude
must be provided; supplying neither raises an error. When drop_na =
FALSE (the default), rows with NA coordinates are still removed
as they cannot pass range validation, and are captured in
attr(result, "invalid").
Use latlong_format to check coordinate formats before
filtering, and latlong_column to identify coordinate columns
if their names are not known in advance.
A data frame containing only rows with valid coordinates, with the same
columns as data. Removed rows are attached as
attr(result, "invalid") for inspection. A console message reports
the total number of rows removed.
latlong_format for checking coordinate formats before
filtering,
latlong_column for detecting coordinate columns in a data
frame,
latlong_convert for converting DMS or base-60 columns to
decimal degrees before filtering,
latlong_range for filtering rows to a user-defined bounding
box,
latlong_region for filtering rows to named geographic regions.
df <- data.frame( id = 1:5, lat = c(51.5, 48.8, 91.0, -33.9, NA), lon = c(-0.1, 2.3, 139.7, 151.2, 37.6) ) # Filter using separate latitude and longitude columns latlong_filter(df, latitude = lat, longitude = lon) # Inspect rows that were removed result <- latlong_filter(df, latitude = lat, longitude = lon) attr(result, "invalid") # Also drop rows where either coordinate is NA latlong_filter(df, latitude = lat, longitude = lon, drop_na = TRUE) # Filter using a combined coordinate column df_combined <- data.frame( id = 1:4, coords = c("51.5,-0.1", "91.0,2.3", "-33.9,151.2", "48.8,181.0") ) latlong_filter(df_combined, combined_col = coords) # Combined column with a custom separator df_sep <- data.frame( coords = c("51.5;-0.1", "91.0;2.3", "-33.9;151.2") ) latlong_filter(df_sep, combined_col = coords, sep = ";")df <- data.frame( id = 1:5, lat = c(51.5, 48.8, 91.0, -33.9, NA), lon = c(-0.1, 2.3, 139.7, 151.2, 37.6) ) # Filter using separate latitude and longitude columns latlong_filter(df, latitude = lat, longitude = lon) # Inspect rows that were removed result <- latlong_filter(df, latitude = lat, longitude = lon) attr(result, "invalid") # Also drop rows where either coordinate is NA latlong_filter(df, latitude = lat, longitude = lon, drop_na = TRUE) # Filter using a combined coordinate column df_combined <- data.frame( id = 1:4, coords = c("51.5,-0.1", "91.0,2.3", "-33.9,151.2", "48.8,181.0") ) latlong_filter(df_combined, combined_col = coords) # Combined column with a custom separator df_sep <- data.frame( coords = c("51.5;-0.1", "91.0;2.3", "-33.9;151.2") ) latlong_filter(df_sep, combined_col = coords, sep = ";")
Detects and reports the coordinate format – decimal degrees, DMS
(DDdeg.MM'SS''}), or base-60 (\code{DDdeg.MM') – of values in one or more
columns. Combined columns (latitude and longitude stored as a single
delimited string) are split before format detection. Results are returned
as a named list, one element per column.
latlong_format(data, columns, sep = ",", drop_na = TRUE)latlong_format(data, columns, sep = ",", drop_na = TRUE)
data |
A data frame containing coordinate columns. |
columns |
Column name or |
sep |
Character. Separator used to split combined coordinate columns
before format detection. Default is |
drop_na |
Logical. If |
The three recognised coordinate formats are:
Decimal degrees: "-12.345" or "51.5"
DMS: "12deg.34'56''N"
Base-60: "12deg.34'N"
A column may return multiple formats if values are inconsistently
formatted – for example, a mix of decimal and DMS entries. This is
reported rather than resolved, allowing the user to decide how to
handle mixed formats before passing columns to
latlong_combine or latlong_split.
Combined columns (those containing sep) are detected automatically
and split before format checking, so the same sep used in
latlong_combine or latlong_split should be
passed here for consistent results.
A named list with one element per column in columns. Each element
is itself a named list with two components:
formatCharacter vector of detected format names present in
the column. One or more of "decimal", "dms",
"base60". Returns "unknown" if no values match any
recognised format (or if all values are excluded by drop_na).
countsInteger vector of the same length as format,
giving the number of values matching each detected format.
latlong_column for detecting which columns in a data frame
contain coordinates,
latlong_combine for merging separate coordinate columns into
one,
latlong_split for splitting a combined coordinate column into
separate latitude and longitude columns,
latlong_filter for removing invalid coordinates after checking
formats,
latlong_convert for converting coordinate formats after
checking.
df <- data.frame( id = 1:4, lat = c("51.5", "48.8", "40.7", "35.6"), lon = c("-0.1", "2.3", "-74.0", "139.7") ) # Check format of a single column latlong_format(df, lat) # Check multiple columns at once latlong_format(df, c(lat, lon)) # Mixed formats in one column df_mixed <- data.frame( coords = c("51.5", "48deg.52'N", "40.7", "35deg.36'00''N") ) latlong_format(df_mixed, coords) # Combined latitude-longitude column with custom separator df_combined <- data.frame( latlon = c("51.5;-0.1", "48.8;2.3", "40.7;-74.0") ) latlong_format(df_combined, latlon, sep = ";") # Include unknown-format values in counts df_dirty <- data.frame( lat = c("51.5", "not_a_coord", "40.7", NA) ) latlong_format(df_dirty, lat, drop_na = FALSE)df <- data.frame( id = 1:4, lat = c("51.5", "48.8", "40.7", "35.6"), lon = c("-0.1", "2.3", "-74.0", "139.7") ) # Check format of a single column latlong_format(df, lat) # Check multiple columns at once latlong_format(df, c(lat, lon)) # Mixed formats in one column df_mixed <- data.frame( coords = c("51.5", "48deg.52'N", "40.7", "35deg.36'00''N") ) latlong_format(df_mixed, coords) # Combined latitude-longitude column with custom separator df_combined <- data.frame( latlon = c("51.5;-0.1", "48.8;2.3", "40.7;-74.0") ) latlong_format(df_combined, latlon, sep = ";") # Include unknown-format values in counts df_dirty <- data.frame( lat = c("51.5", "not_a_coord", "40.7", NA) ) latlong_format(df_dirty, lat, drop_na = FALSE)
Appends NS_hemisphere and EW_hemisphere columns to a data
frame based on the sign of coordinate values. Accepts either separate
latitude and longitude columns or a single combined coordinate column.
Supports decimal degree, DMS (DDdeg MM'SS''}), and base-60
(\code{DDdeg MM') coordinate formats.
latlong_hemisphere( data, latitude = NULL, longitude = NULL, combined_col = NULL, drop_na = FALSE )latlong_hemisphere( data, latitude = NULL, longitude = NULL, combined_col = NULL, drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Optional. Column name of the latitude column, supplied
either unquoted ( |
longitude |
Optional. Column name of the longitude column, supplied
either unquoted ( |
combined_col |
Optional. Column name of a combined coordinate column
containing latitude and longitude as a single comma-delimited string
(e.g. |
drop_na |
Logical. If |
All coordinate formats are parsed to decimal degrees internally before
hemisphere assignment. The parser handles decimal, DMS, and base-60
formats, inferring sign from cardinal direction suffixes (S,
W) or the sign of the degree value. Zero-width and BOM characters
are stripped before parsing.
Either combined_col or both latitude and longitude
must be provided; supplying neither raises an error. When drop_na =
FALSE (the default), rows with unparseable or NA coordinates are
retained with NA in the hemisphere columns.
The combined column separator is assumed to be ",". Use
latlong_split to separate a combined column with a different
delimiter before calling this function.
Use latlong_filter to remove out-of-range coordinates before
assigning hemispheres, and latlong_format to verify
coordinate formats in advance.
The input data frame with two additional character columns appended:
NS_hemisphere"North" if latitude is greater than or
equal to zero, "South" if negative, NA if the coordinate
could not be parsed.
EW_hemisphere"East" if longitude is greater than or
equal to zero, "West" if negative, NA if the coordinate
could not be parsed.
Rows removed by drop_na = TRUE are attached as
attr(result, "removed_na") for inspection.
latlong_filter for removing invalid coordinates before
hemisphere assignment,
latlong_format for checking coordinate formats,
latlong_column for detecting coordinate columns in a data
frame,
latlong_convert for converting DMS or base-60 columns to
decimal degrees before hemisphere assignment.
df <- data.frame( id = 1:4, lat = c(51.5, -33.9, 48.8, -23.5), lon = c(-0.1, 151.2, 2.3, -46.6) ) # Assign hemispheres from separate latitude and longitude columns latlong_hemisphere(df, latitude = lat, longitude = lon) # Assign hemispheres from a combined coordinate column df_combined <- data.frame( id = 1:4, coords = c("51.5,-0.1", "-33.9,151.2", "48.8,2.3", "-23.5,-46.6") ) latlong_hemisphere(df_combined, combined_col = coords) # Drop rows where either coordinate is NA df_na <- data.frame( lat = c(51.5, NA, -33.9), lon = c(-0.1, 2.3, NA) ) latlong_hemisphere(df_na, latitude = lat, longitude = lon, drop_na = TRUE) # Inspect rows removed by drop_na result <- latlong_hemisphere(df_na, latitude = lat, longitude = lon, drop_na = TRUE) attr(result, "removed_na")df <- data.frame( id = 1:4, lat = c(51.5, -33.9, 48.8, -23.5), lon = c(-0.1, 151.2, 2.3, -46.6) ) # Assign hemispheres from separate latitude and longitude columns latlong_hemisphere(df, latitude = lat, longitude = lon) # Assign hemispheres from a combined coordinate column df_combined <- data.frame( id = 1:4, coords = c("51.5,-0.1", "-33.9,151.2", "48.8,2.3", "-23.5,-46.6") ) latlong_hemisphere(df_combined, combined_col = coords) # Drop rows where either coordinate is NA df_na <- data.frame( lat = c(51.5, NA, -33.9), lon = c(-0.1, 2.3, NA) ) latlong_hemisphere(df_na, latitude = lat, longitude = lon, drop_na = TRUE) # Inspect rows removed by drop_na result <- latlong_hemisphere(df_na, latitude = lat, longitude = lon, drop_na = TRUE) attr(result, "removed_na")
Identifies and reports the minimum and maximum latitude and longitude
values in a data frame. Accepts separate latitude and longitude columns,
a combined coordinate column, or auto-detects coordinate columns using
latlong_column when no columns are specified. Prints a
summary message and returns the data frame unchanged, making it safe to
use mid-pipeline.
latlong_limits( data, latitude = NULL, longitude = NULL, column = NULL, drop_na = FALSE )latlong_limits( data, latitude = NULL, longitude = NULL, column = NULL, drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Optional. Column name of the latitude column, supplied
either unquoted ( |
longitude |
Optional. Column name of the longitude column, supplied
either unquoted ( |
column |
Optional. Column name of a combined coordinate column
containing latitude and longitude as a single delimited string
(e.g. |
drop_na |
Logical. If |
When none of latitude, longitude, or column are
provided, coordinate columns are auto-detected via
latlong_column. The first detected latitude, longitude, and
combined column are used. An error is raised if no coordinate columns are
found.
Values outside valid geographic ranges ([-90, 90] for latitude,
[-180, 180] for longitude) are silently excluded from the limit
calculation. Use latlong_filter to remove such rows from the
data frame explicitly.
Combined columns are split on ",", ";", or whitespace before
parsing. Only numeric (decimal degree) values are extracted from combined
columns; DMS and base-60 formats in combined columns are not parsed.
Use latlong_convert to convert to decimal degrees first if
needed.
The original data frame, returned invisibly and unchanged. This
function is called for its side effect of printing latitude and longitude
range messages to the console, and is safe to use within a pipeline
(e.g. with |>) without altering the data.
latlong_column for detecting coordinate columns
automatically,
latlong_filter for removing out-of-range coordinates,
latlong_range for filtering rows to a user-defined bounding
box using the limits reported by this function,
latlong_convert for converting DMS or base-60 columns to
decimal before computing limits.
df <- data.frame( id = 1:4, lat = c(51.5, 48.8, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, -74.0) ) # Report limits from separate latitude and longitude columns latlong_limits(df, latitude = lat, longitude = lon) # Auto-detect coordinate columns latlong_limits(df) # Report limits from a combined coordinate column df_combined <- data.frame( coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2", "40.7,-74.0") ) latlong_limits(df_combined, column = coords) # Exclude NA values from the limit calculation df_na <- data.frame( lat = c(51.5, NA, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, NA) ) latlong_limits(df_na, latitude = lat, longitude = lon, drop_na = TRUE) # Safe to use mid-pipeline — data is returned unchanged df |> latlong_limits(latitude = lat, longitude = lon) |> latlong_filter(latitude = lat, longitude = lon)df <- data.frame( id = 1:4, lat = c(51.5, 48.8, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, -74.0) ) # Report limits from separate latitude and longitude columns latlong_limits(df, latitude = lat, longitude = lon) # Auto-detect coordinate columns latlong_limits(df) # Report limits from a combined coordinate column df_combined <- data.frame( coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2", "40.7,-74.0") ) latlong_limits(df_combined, column = coords) # Exclude NA values from the limit calculation df_na <- data.frame( lat = c(51.5, NA, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, NA) ) latlong_limits(df_na, latitude = lat, longitude = lon, drop_na = TRUE) # Safe to use mid-pipeline — data is returned unchanged df |> latlong_limits(latitude = lat, longitude = lon) |> latlong_filter(latitude = lat, longitude = lon)
Retains only rows where coordinates fall within specified latitude and
longitude bounds. Unlike latlong_filter, which validates
against absolute geographic limits, this function filters to a
user-defined bounding box. Use latlong_limits first to
inspect the coordinate extent of the data and inform suitable bound values.
latlong_range( data, latitude, longitude, lat_min, lat_max, lon_min, lon_max, drop_na = FALSE )latlong_range( data, latitude, longitude, lat_min, lat_max, lon_min, lon_max, drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Column name of the latitude column, supplied either
unquoted ( |
longitude |
Column name of the longitude column, supplied either
unquoted ( |
lat_min |
Numeric. Minimum latitude bound (inclusive). Must be in the
range |
lat_max |
Numeric. Maximum latitude bound (inclusive). Must be in the
range |
lon_min |
Numeric. Minimum longitude bound (inclusive). Must be in the
range |
lon_max |
Numeric. Maximum longitude bound (inclusive). Must be in the
range |
drop_na |
Logical. If |
Coordinate columns are coerced to numeric via as.numeric() before
filtering. Non-numeric values (including DMS or base-60 strings) will
produce NA after coercion and be treated as out-of-range. Use
latlong_convert to convert to decimal degrees before calling
this function if columns are not already numeric.
All bounds are inclusive. Rows with NA coordinates are excluded from
the retained set regardless of drop_na, as they cannot be evaluated
against the bounds. When drop_na = FALSE, NA rows contribute
to the out-of-range count in the console message rather than the NA
count.
A data frame containing only rows where latitude falls within
[lat_min, lat_max] and longitude falls within
[lon_min, lon_max], with the same columns as data. A console
message reports the total rows removed, broken down by NA rows and
out-of-range rows.
latlong_limits for inspecting the coordinate extent of a
data frame to inform bound selection,
latlong_filter for removing coordinates outside absolute
geographic validity ranges,
latlong_convert for converting DMS or base-60 columns to
decimal degrees before filtering,
latlong_split for separating a combined coordinate column
into distinct latitude and longitude columns before filtering as
latlong_range does not function with combined columns,
latlong_region for filtering to named geographic regions
rather than a numeric bounding box.
df <- data.frame( id = 1:6, lat = c(51.5, 48.8, -33.9, 40.7, 35.6, 55.8), lon = c(-0.1, 2.3, 151.2, -74.0, 139.7, 37.6) ) # Retain only rows within a European bounding box latlong_range(df, latitude = lat, longitude = lon, lat_min = 35, lat_max = 60, lon_min = -10, lon_max = 40) # Use latlong_limits first to inspect coordinate extent df |> latlong_limits(latitude = lat, longitude = lon) |> latlong_range(latitude = lat, longitude = lon, lat_min = 35, lat_max = 60, lon_min = -10, lon_max = 40) # Drop NA rows before filtering df_na <- data.frame( lat = c(51.5, NA, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, NA) ) latlong_range(df_na, latitude = lat, longitude = lon, lat_min = 0, lat_max = 60, lon_min = -10, lon_max = 40, drop_na = TRUE)df <- data.frame( id = 1:6, lat = c(51.5, 48.8, -33.9, 40.7, 35.6, 55.8), lon = c(-0.1, 2.3, 151.2, -74.0, 139.7, 37.6) ) # Retain only rows within a European bounding box latlong_range(df, latitude = lat, longitude = lon, lat_min = 35, lat_max = 60, lon_min = -10, lon_max = 40) # Use latlong_limits first to inspect coordinate extent df |> latlong_limits(latitude = lat, longitude = lon) |> latlong_range(latitude = lat, longitude = lon, lat_min = 35, lat_max = 60, lon_min = -10, lon_max = 40) # Drop NA rows before filtering df_na <- data.frame( lat = c(51.5, NA, -33.9, 40.7), lon = c(-0.1, 2.3, 151.2, NA) ) latlong_range(df_na, latitude = lat, longitude = lon, lat_min = 0, lat_max = 60, lon_min = -10, lon_max = 40, drop_na = TRUE)
Retains only rows whose coordinates fall within one or more named geographic regions. Supports countries, sovereign territories, broader geographic regions, and named marine areas including seas, bays, gulfs, straits, and ocean basins via Natural Earth data. Region terms are matched case-insensitively across all available name fields in both land and marine polygon layers.
latlong_region( data, latitude, longitude, region, dataset = "auto", drop_na = FALSE )latlong_region( data, latitude, longitude, region, dataset = "auto", drop_na = FALSE )
data |
A data frame containing coordinate columns. |
latitude |
Column name of the latitude column, supplied either
unquoted ( |
longitude |
Column name of the longitude column, supplied either
unquoted ( |
region |
Character vector of region names to match. Partial and
case-insensitive matching is supported across country names, sovereign
states, administrative regions, subregions, continents, and named marine
areas such as seas, gulfs, and straits. Multiple values are unioned
before filtering (e.g. |
dataset |
Character. Natural Earth dataset scale to use.
Default is |
drop_na |
Logical. If |
Two Natural Earth polygon layers are queried: country polygons covering
land territories, regions, subregions, and continents; and geographic
region polygons covering named marine and physical features. Each
region term is matched against all available name fields in both
layers, and all matched polygons are unioned into a single bounding
geometry before the spatial filter is applied.
Shapefiles are downloaded via rnaturalearth::ne_download on first
use and cached locally — subsequent calls with the same dataset are fast.
Requires the rnaturalearth and sf packages; an informative
error is raised if either is not installed.
Coordinate columns must be in decimal degrees. Use
latlong_convert to convert DMS or base-60 columns first.
Use latlong_limits to inspect the coordinate extent of the
data before choosing region terms, and latlong_filter to
remove invalid coordinates beforehand.
An error is raised if no polygons match any of the supplied
region terms across either layer.
A data frame containing only rows whose coordinates fall within the union
of all matched region polygons, with the same columns as data.
Console messages report the matched country and geographic region names,
the number of rows retained, and the number excluded outside the specified
regions.
latlong_limits for inspecting coordinate extent before
choosing region terms,
latlong_filter for removing invalid coordinates before
filtering,
latlong_range for filtering to a user-defined bounding box
rather than a named region,
latlong_split for separating a combined coordinate column
into distinct latitude and longitude columns before filtering as
latlong_range does not function with combined columns,
latlong_convert for converting DMS or base-60 columns to
decimal degrees before filtering.
if (requireNamespace("rnaturalearth", quietly = TRUE) && requireNamespace("sf", quietly = TRUE)) { df <- data.frame( id = 1:4, lat = c(35.6, 34.0, 51.5, 48.8), lon = c(139.7, 131.0, -0.1, 2.3) ) # Filter to a single country latlong_region(df, latitude = lat, longitude = lon, region = "Japan") # Filter to multiple regions — land and marine areas unioned latlong_region(df, latitude = lat, longitude = lon, region = c("Japan", "Sea of Japan", "East China Sea")) # Filter to a continent latlong_region(df, latitude = lat, longitude = lon, region = "Europe") # Drop NA rows before filtering latlong_region(df, latitude = lat, longitude = lon, region = "Japan", drop_na = TRUE) }if (requireNamespace("rnaturalearth", quietly = TRUE) && requireNamespace("sf", quietly = TRUE)) { df <- data.frame( id = 1:4, lat = c(35.6, 34.0, 51.5, 48.8), lon = c(139.7, 131.0, -0.1, 2.3) ) # Filter to a single country latlong_region(df, latitude = lat, longitude = lon, region = "Japan") # Filter to multiple regions — land and marine areas unioned latlong_region(df, latitude = lat, longitude = lon, region = c("Japan", "Sea of Japan", "East China Sea")) # Filter to a continent latlong_region(df, latitude = lat, longitude = lon, region = "Europe") # Drop NA rows before filtering latlong_region(df, latitude = lat, longitude = lon, region = "Japan", drop_na = TRUE) }
Splits a single combined coordinate column into separate latitude and
longitude columns, appended to the data frame. The inverse of this
operation is latlong_combine. Useful as a prerequisite for
functions that require separate coordinate columns, such as
latlong_range and latlong_region.
latlong_split( data, combined_col, latitude, longitude, sep = ",", drop_na = FALSE )latlong_split( data, combined_col, latitude, longitude, sep = ",", drop_na = FALSE )
data |
A data frame containing a combined coordinate column. |
combined_col |
Column name of the combined coordinate column
containing latitude and longitude as a single delimited string
(e.g. |
latitude |
Column name for the new latitude column to be appended,
supplied either unquoted ( |
longitude |
Column name for the new longitude column to be appended,
supplied either unquoted ( |
sep |
Character. Separator between latitude and longitude values in
|
drop_na |
Logical. If |
Splitting is performed by strsplit() on sep, with leading
and trailing whitespace trimmed from each part. Rows where
combined_col contains fewer than two parts after splitting produce
NA in the longitude column. Input strings are converted to UTF-8
before splitting to handle encoded coordinate values.
The new coordinate columns are character type regardless of input format.
Use latlong_format to verify the format of the split columns,
and latlong_convert to convert to a target format before
passing to other functions.
The input data frame with two additional character columns appended, named
according to latitude and longitude. The original
combined_col is retained. Values are returned as character strings;
use as.numeric() or latlong_convert if numeric decimal
degree values are required downstream. A console message reports the number
of rows removed when drop_na = TRUE.
latlong_combine for merging separate coordinate columns into
a single combined column,
latlong_format for checking the format of the split columns,
latlong_convert for converting split columns to a target
coordinate format,
latlong_range for filtering to a bounding box, which does
not accept combined columns,
latlong_region for filtering to named geographic regions,
which does not accept combined columns.
df <- data.frame( id = 1:4, coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2", "40.7,-74.0") ) # Split into separate latitude and longitude columns latlong_split(df, combined_col = coords, latitude = lat, longitude = lon) # Use a custom separator df_sep <- data.frame( coords = c("51.5;-0.1", "48.8;2.3", "-33.9;151.2") ) latlong_split(df_sep, combined_col = coords, latitude = lat, longitude = lon, sep = ";") # Drop rows where splitting produces NA df_na <- data.frame( coords = c("51.5,-0.1", "48.8", NA, "40.7,-74.0") ) latlong_split(df_na, combined_col = coords, latitude = lat, longitude = lon, drop_na = TRUE) # Split then filter by bounding box df |> latlong_split(combined_col = coords, latitude = lat, longitude = lon) |> latlong_range(latitude = lat, longitude = lon, lat_min = 0, lat_max = 60, lon_min = -10, lon_max = 40)df <- data.frame( id = 1:4, coords = c("51.5,-0.1", "48.8,2.3", "-33.9,151.2", "40.7,-74.0") ) # Split into separate latitude and longitude columns latlong_split(df, combined_col = coords, latitude = lat, longitude = lon) # Use a custom separator df_sep <- data.frame( coords = c("51.5;-0.1", "48.8;2.3", "-33.9;151.2") ) latlong_split(df_sep, combined_col = coords, latitude = lat, longitude = lon, sep = ";") # Drop rows where splitting produces NA df_na <- data.frame( coords = c("51.5,-0.1", "48.8", NA, "40.7,-74.0") ) latlong_split(df_na, combined_col = coords, latitude = lat, longitude = lon, drop_na = TRUE) # Split then filter by bounding box df |> latlong_split(combined_col = coords, latitude = lat, longitude = lon) |> latlong_range(latitude = lat, longitude = lon, lat_min = 0, lat_max = 60, lon_min = -10, lon_max = 40)
Combined Octopodoidea occurrence records for Japan produced by merging the
five Japan-filtered source data sets (GBIF_Japan,
InvBase_Japan, BISMAL_Japan,
OBIS_Japan, and NSMT_Japan) via rbind.
Duplicate records are removed using deduplicate on the
catalogNumber field, and individual-level records are reconstructed
from aggregated specimen counts using duplicate on the
individualCount field. See museum_taxon for the
taxonomically validated and enriched version.
museummuseum
A data frame with 2,633 rows and 13 variables:
Scientific name as recorded in the source data set.
Genus name.
Family name.
Year of occurrence record.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country name or code as recorded in the source data set.
State, province, or region as recorded in the source data set.
Locality description as recorded in the source data set.
Institution code or group abbreviation identifying the collecting institution.
Character. Identifies the source data set for each row.
One of "GBIF", "InvBase", "BISMAL",
"OBIS", or "NSMT".
Museum lot identification code used for duplicate
detection. Rows with NA in this field were removed during
deduplication.
Specimen count per lot. Used to expand rows via
duplicate to reconstruct individual-level records.
Processing proceeds in the following steps:
The five Japan-filtered data sets are combined via rbind
with a Data Frame column added to identify the source of each
row, producing 2,707 observations.
deduplicate is applied on catalogNumber with
drop_na = TRUE, removing 608 rows with missing
catalogNumber and 143 duplicate rows, leaving 1,956
observations. Duplicate records are accessible via
attr(museum, "duplicates").
duplicate is applied on individualCount to
expand aggregated specimen counts to individual-level records,
increasing the row count from 1,956 to 2,633.
Derived from GBIF_Japan, InvBase_Japan,
BISMAL_Japan, OBIS_Japan, and
NSMT_Japan. Full source CSVs (raw, trimmed, and
Japan-filtered) are available at
https://github.com/btorgovitsky00/datamuseum.
Original sources:
Global Biodiversity Information Facility (GBIF). GBIF.org (30 March 2026) GBIF Occurrence Download. https://www.gbif.org doi:10.15468/dl.2379hj
Invert-E-Base. Downloaded 30 March 2026. https://invertebase.org
Biological Information System for Marine Life (BISMAL). Downloaded 30 March 2026. https://www.godac.jamstec.go.jp/bismal/e/
Ocean Biodiversity Information System (OBIS). Downloaded 30 March 2026. https://obis.org
National Museum of Nature and Science, Japan (NSMT). Data obtained directly from the museum, early 2024. https://www.kahaku.go.jp/english/
GBIF_Japan, InvBase_Japan,
BISMAL_Japan, OBIS_Japan,
NSMT_Japan for the individual source data sets,
deduplicate for the deduplication function applied during
processing,
duplicate for the row expansion function applied during
processing,
museum_taxon for the taxonomically validated and enriched
version.
The combined Japan Octopodoidea data set (museum) after full
taxonomic cleaning, validation, synonym resolution, rank enrichment,
authorship appending, and italic formatting. Represents the final stage
of the datamuseum workflow and is intended for direct use in
analysis and visualisation.
museum_taxonmuseum_taxon
A data frame with 2,222 rows and 20 variables:
Validated scientific name in accepted nomenclature, canonical form without authorship.
Genus name, updated by taxon_validate where
the primary name changed.
Family name.
Taxonomic order, appended by taxon_add.
Taxonomic phylum, appended by taxon_add.
Year of occurrence record.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country name or code as recorded in the source data set.
State, province, or region as recorded in the source data set.
Locality description as recorded in the source data set.
Institution code or group abbreviation identifying the collecting institution.
Character. Identifies the source data set for each row.
One of "GBIF", "InvBase", "BISMAL",
"OBIS", or "NSMT".
Museum lot identification code.
Specimen count per lot.
Family name with authorship appended by
taxon_cite. Enteroctopodidae authorship added
manually as it could not be resolved automatically.
Genus name with authorship appended by
taxon_cite.
Scientific name with authorship appended by
taxon_cite.
Plotmath italic expression for
Genus_cite, suitable for use in ggplot2 via
italicize.
Plotmath italic expression for
SciName_cite, suitable for use in ggplot2 via
italicize.
Processing proceeds in the following steps from museum:
taxon_cleaner applied to SciName in place
with drop_na = TRUE, removing uncertain names and reducing
the data set from 2,633 to 2,222 observations.
Octopus vulgaris manually corrected to Octopus sinensis to reflect current accepted taxonomy for the Pacific form.
taxon_validate applied to SciName with
update_related = TRUE to resolve synonyms and update related
taxonomic columns.
taxon_spellcheck applied with update = TRUE
using the pre-computed validation report.
Pinnoctopus manually corrected to Callistoctopus
across all columns — a generic synonym not resolved automatically by
taxon_validate.
taxon_add appends order and phylum
with sort = TRUE.
taxon_cite appends authorship to Family,
Genus, and SciName.
Muusoctopus small in mature removed as an informal morphospecies name not representing a valid taxon.
Enteroctopodidae authorship added manually as it could
not be resolved by taxon_cite.
italicize applied to Genus_cite and
SciName_cite.
Derived from museum. Full source CSVs (raw, trimmed, and
Japan-filtered) are available at
https://github.com/btorgovitsky00/datamuseum.
Original sources:
Global Biodiversity Information Facility (GBIF). GBIF.org (30 March 2026) GBIF Occurrence Download. https://www.gbif.org doi:10.15468/dl.2379hj
Invert-E-Base. Downloaded 30 March 2026. https://invertebase.org
Biological Information System for Marine Life (BISMAL). Downloaded 30 March 2026. https://www.godac.jamstec.go.jp/bismal/e/
Ocean Biodiversity Information System (OBIS). Downloaded 30 March 2026. https://obis.org
National Museum of Nature and Science, Japan (NSMT). Data obtained directly from the museum, early 2024. https://www.kahaku.go.jp/english/
museum for the combined pre-validation data set,
taxon_cleaner for the cleaning function applied during
processing,
taxon_validate for the validation function applied during
processing,
taxon_spellcheck for the spellcheck function applied during
processing,
taxon_add for the rank enrichment function applied during
processing,
taxon_cite for the authorship appending function applied
during processing,
italicize for the italic formatting function applied during
processing.
Japanese National Museum of Nature and Science (NSMT) Octopodoidea
occurrence records filtered to the Japan bounding box
(latitude 25–50, longitude 125–150) and standardized to the common
column set shared across all datamuseum Japan data sets.
Unlike other sources, coordinate columns were already named Latitude
and Longitude in the raw data and required no renaming. SciName
is constructed from Genus, Species, and Subspecies,
with trailing NA strings removed to handle records without a
subspecies. This is the only source to incorporate a subspecies component
in SciName. No rows were removed by the NA filter, giving
the highest retention rate of all five sources at 79.9% of raw records.
NSMT_JapanNSMT_Japan
A data frame with 695 rows and 12 variables:
Scientific name constructed from Genus,
Species, and Subspecies where present. Trailing
NA strings removed.
Genus name.
Family name.
Year of occurrence record.
Decimal latitude, filtered to [25, 50].
Already named Latitude in the raw data.
Decimal longitude, filtered to [125, 150].
Already named Longitude in the raw data.
Country name.
Region, from Region.
Locality description, from Previse.loc.
— note this reflects a typographic irregularity in the original NSMT
data.
Museum group abbreviation, from Group.Abb..
Museum lot identification code. Used for duplicate
detection in museum via deduplicate.
Specimen count per lot. Used for row expansion
in museum via duplicate.
Derived from data obtained directly from the National Museum of Nature and Science, Japan. Full source CSVs (raw, trimmed, and Japan-filtered) are available at https://github.com/btorgovitsky00/datamuseum.
National Museum of Nature and Science, Japan (NSMT). Data obtained directly from the museum, early 2024. https://www.kahaku.go.jp/english/
The raw and trimmed intermediate versions of this data set are available as CSV files at https://github.com/btorgovitsky00/datamuseum.
museum for the combined data set including these records.
Ocean Biodiversity Information System (OBIS) Octopodoidea occurrence
records filtered to the Japan bounding box (latitude 25–50,
longitude 125–150) and standardized to the common column set shared
across all datamuseum Japan data sets.
SciName is taken directly from the species field. Rows with
NA in Source, Family, Genus, SciName,
or Year are removed.
OBIS_JapanOBIS_Japan
A data frame with 668 rows and 12 variables:
Scientific name taken directly from the species
field. Trailing NA strings removed.
Genus name.
Family name.
Year of occurrence record, from date_year. Note
this field differs from all other sources which use year.
Decimal latitude, filtered to [25, 50].
Decimal longitude, filtered to [125, 150].
Country name, from country.
State or province, from stateProvince.
Locality description, from locality.
Institution code, from institutionCode.
Museum lot identification code. Used for duplicate
detection in museum via deduplicate.
Specimen count per lot. Used for row expansion
in museum via duplicate.
The raw and trimmed intermediate versions of this data set are available as CSV files in the package data repository. Note that those files contain non-ASCII characters in locality and collector name fields, reflecting the international scope of OBIS occurrence records.
Derived from the raw OBIS occurrence download. Full source CSVs (raw, trimmed, and Japan-filtered) are available at https://github.com/btorgovitsky00/datamuseum.
Ocean Biodiversity Information System (OBIS). Downloaded 30 March 2026. https://obis.org
The raw and trimmed intermediate versions of this data set are available as CSV files at https://github.com/btorgovitsky00/datamuseum.
museum for the combined data set including these records.
Looks up and appends one or more higher taxonomic rank columns to a data
frame using GBIF and/or ITIS as reference sources. Intended for use after
taxon_validate and taxon_spellcheck to append
ranks not already present in the data frame. Results are cached to disk
to speed up repeated calls.
taxon_add( data, column, ranks, source = "both", author_year = FALSE, sort = FALSE, drop_na = FALSE )taxon_add( data, column, ranks, source = "both", author_year = FALSE, sort = FALSE, drop_na = FALSE )
data |
A data frame. |
column |
Column name of the taxonomic column to look up from,
supplied either unquoted ( |
ranks |
Rank name or |
source |
Character. Taxonomic reference source. One of |
author_year |
Logical. If |
sort |
Logical. If |
drop_na |
Logical. If |
GBIF is queried via rgbif::name_backbone() and
rgbif::name_usage(); ITIS is queried via taxize::get_tsn()
and taxize::classification(). Results are cached to disk using
memoise and cachem in
tools::R_user_dir("taxon_add", "cache"), so repeated calls for
the same names are fast. Requires memoise and cachem;
rgbif and/or taxize are required depending on source.
Only unique non-NA values in column are looked up, so
performance scales with the number of distinct names rather than total
rows.
When author_year = TRUE, authorship is resolved via a separate
GBIF lookup on the canonical name returned for each rank. If the resolved
name with authorship is identical to the canonical name, or produces empty
parentheses, the canonical name is returned unchanged.
Use taxon_column to detect existing taxonomic rank columns
before adding new ones, and taxon_sort to reorder columns
into standard rank order independently of this function.
The input data frame with one new character column appended per entry in
ranks, named by rank (e.g. family, order). A report
tibble is attached as attr(result, "add_report") with columns:
columnName of the source column looked up from.
nameInput name for which the rank could not be resolved.
missing_rankThe rank that could not be resolved for that name.
nNumber of rows containing that name.
An empty tibble is attached when all ranks are resolved. A console message per rank reports the number of values resolved and lists unresolved names.
This function queries external web services (GBIF via rgbif and/or
ITIS via taxize) and requires an active internet connection with
reliable access to those servers. Performance on unstable or restricted
connections (e.g. public WiFi, VPN, or firewalled networks) may be slow
or produce incomplete results. Previously queried names are cached to disk
via memoise and cachem at
tools::R_user_dir("taxon_add", "cache"), so running on a stable
connection first will speed up subsequent calls regardless of connection
quality.
Connectivity can be tested before adding ranks:
# Test ITIS connectivity
taxize::get_tsn("Homo sapiens", accepted = FALSE, verbose = TRUE,
messages = TRUE, ask = FALSE)
# Test GBIF connectivity
rgbif::name_backbone(name = "Homo sapiens", strict = TRUE)
taxon_validate for validating and resolving synonyms before
adding ranks,
taxon_spellcheck for correcting misspellings before adding
ranks,
taxon_cite for appending authorship and year after adding
ranks,
taxon_sort for sorting columns into standard taxonomic rank
order,
taxon_column for detecting existing taxonomic rank columns
before adding new ones.
df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Add a single rank taxon_add(df, column = species, ranks = family) # Add multiple ranks at once taxon_add(df, column = species, ranks = c(family, order, class)) # Use GBIF only as the source taxon_add(df, column = species, ranks = family, source = "gbif") # Append authorship to resolved rank names taxon_add(df, column = species, ranks = c(family, genus), author_year = TRUE) # Add ranks and sort into standard taxonomic order taxon_add(df, column = species, ranks = c(family, order, class), sort = TRUE) # Inspect names where ranks could not be resolved result <- taxon_add(df, column = species, ranks = c(family, order)) attr(result, "add_report") }df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Add a single rank taxon_add(df, column = species, ranks = family) # Add multiple ranks at once taxon_add(df, column = species, ranks = c(family, order, class)) # Use GBIF only as the source taxon_add(df, column = species, ranks = family, source = "gbif") # Append authorship to resolved rank names taxon_add(df, column = species, ranks = c(family, genus), author_year = TRUE) # Add ranks and sort into standard taxonomic order taxon_add(df, column = species, ranks = c(family, order, class), sort = TRUE) # Inspect names where ranks could not be resolved result <- taxon_add(df, column = species, ranks = c(family, order)) attr(result, "add_report") }
Appends authorship and year to one or more taxonomic name columns using
GBIF (preferred) and ITIS (fallback) as reference sources. For each
specified column a new <column>_cite column is appended containing
names in the format "Genus species (Author, Year)". Intended as
the final step in the taxon_validate -> taxon_spellcheck
workflow.
taxon_cite(data, columns, source = "both", drop_na = FALSE)taxon_cite(data, columns, source = "both", drop_na = FALSE)
data |
A data frame. |
columns |
Column name or |
source |
Character. Taxonomic reference source. One of |
drop_na |
Logical. If |
Authorship look-up follows the same logic as pass 5 of
taxon_validate:
GBIF name_backbone is queried with strict = TRUE.
HIGHERRANK results are accepted only when the canonical
name matches the input exactly.
Malformed authorship strings starting with a comma or punctuation are rejected and treated as missing.
Synonym chains are followed up to three steps via
name_usage() to reach the accepted name authorship.
When GBIF has no valid authorship and source = "both",
ITIS itis_getrecord is queried as a fallback.
Authorship is stripped from input values before lookup (parenthetical
suffixes matching \s*\(.*\)\s*$ are removed), so columns
already containing authorship from a prior taxon_validate
call are handled correctly.
Results are memoised for the duration of the session. Only unique
non-NA values are looked up, so performance scales with the number
of distinct names rather than total rows.
Requires rgbif for GBIF lookups, taxize for ITIS lookups, and memoise. Informative errors are raised if required packages are not installed.
The input data frame with one additional character column appended per
entry in columns, named <column>_cite. Rows where
authorship cannot be found retain the original canonical name in the cite
column unchanged. A report tibble is attached as
attr(result, "cite_report") with columns:
columnName of the column processed.
nameCanonical name for which no authorship was found.
nNumber of rows containing that name.
An empty tibble is attached when authorship is found for all names. A console message per column reports the number of names resolved and lists any names without authorship.
This function queries external web services (GBIF via rgbif and/or ITIS via taxize) and requires an active internet connection with reliable access to those servers. Performance on unstable or restricted connections (e.g. public WiFi, VPN, or firewalled networks) may be slow or produce incomplete results. Results are memoised for the duration of the session; running on a stable connection first and retaining the session will avoid repeated API calls for the same names.
Connectivity can be tested before appending authorship:
# Test ITIS connectivity
taxize::get_tsn("Homo sapiens", accepted = FALSE, verbose = TRUE,
messages = TRUE, ask = FALSE)
# Test GBIF connectivity
rgbif::name_backbone(name = "Homo sapiens", strict = TRUE)
taxon_validate for resolving synonyms and validating names
before appending authorship,
taxon_spellcheck for correcting misspellings before
appending authorship,
taxon_add for appending higher taxonomic rank columns
alongside authorship,
italicize for formatting cited names for ggplot2
display as the final step in the workflow.
df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Append authorship to a single column taxon_cite(df, species) # Append authorship to multiple columns df2 <- data.frame( genus = c("Homo", "Panthera"), species = c("Homo sapiens", "Panthera leo") ) taxon_cite(df2, c(genus, species)) # Use GBIF only taxon_cite(df, species, source = "gbif") # Inspect names where no authorship was found result <- taxon_cite(df, species) attr(result, "cite_report") # Full workflow df |> taxon_validate(column = species) |> taxon_spellcheck(column = species, update = TRUE) |> taxon_add(column = species, ranks = c(family, order)) |> taxon_cite(columns = species) }df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Append authorship to a single column taxon_cite(df, species) # Append authorship to multiple columns df2 <- data.frame( genus = c("Homo", "Panthera"), species = c("Homo sapiens", "Panthera leo") ) taxon_cite(df2, c(genus, species)) # Use GBIF only taxon_cite(df, species, source = "gbif") # Inspect names where no authorship was found result <- taxon_cite(df, species) attr(result, "cite_report") # Full workflow df |> taxon_validate(column = species) |> taxon_spellcheck(column = species, update = TRUE) |> taxon_add(column = species, ranks = c(family, order)) |> taxon_cite(columns = species) }
Standardises taxonomic name formatting by removing extra whitespace,
stripping control characters, and flagging uncertain names containing
cf., sp., or ? as NA. Cleaned values are
either appended as a new <column>_clean column or used to replace
the original column in place.
taxon_cleaner(data, columns, in_place = FALSE, drop_na = FALSE)taxon_cleaner(data, columns, in_place = FALSE, drop_na = FALSE)
data |
A data frame. |
columns |
Column name or |
in_place |
Logical. If |
drop_na |
Logical. If |
Standardizes taxonomic name formatting by removing extra whitespace, fixing capitalization, and flagging uncertain names (cf., sp., ?).
Clean taxonomic name formatting
Cleaning applies the following steps in order to each column:
Leading and trailing whitespace is removed via
stringr::str_trim().
Internal runs of whitespace are collapsed to a single space.
Control characters are stripped.
Values matching cf., sp., or ? (
case-insensitive, whole-word) are replaced with NA.
Uncertain name detection is reported before flagging, so the console
message reflects the count in the original values rather than after
replacement. When multiple columns are supplied, drop_na is applied
independently to each column in sequence, so row counts may differ across
columns.
Capitalisation is not modified; names are returned with the same case as the input after whitespace normalisation.
The input data frame with cleaned taxonomic columns. When
in_place = FALSE, one new character column is inserted per entry in
columns, named <column>_clean and positioned immediately
after the source column. A console message per column reports the number
of NA values and the number of uncertain names flagged before
cleaning.
taxon_combine for merging genus and epithet columns after
cleaning,
taxon_split for splitting a binomial name column before
cleaning individual parts,
taxon_validate for validating cleaned names against ITIS
and GBIF,
taxon_spellcheck for identifying and correcting misspellings
after cleaning,
taxon_add for appending higher taxonomic rank columns,
italicize for formatting taxonomic names for
ggplot2 display.
df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis cf. lupus", "Ursus sp.", NA) ) # Append a cleaned column (default) taxon_cleaner(df, species) # Clean in place taxon_cleaner(df, in_place = TRUE, columns = species) # Drop rows flagged as uncertain or NA after cleaning taxon_cleaner(df, species, drop_na = TRUE) # Clean multiple columns at once df2 <- data.frame( genus = c("Homo", "Panthera", "Canis cf.", NA), species = c("sapiens", "leo ", "lupus", "arctos") ) taxon_cleaner(df2, c(genus, species))df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis cf. lupus", "Ursus sp.", NA) ) # Append a cleaned column (default) taxon_cleaner(df, species) # Clean in place taxon_cleaner(df, in_place = TRUE, columns = species) # Drop rows flagged as uncertain or NA after cleaning taxon_cleaner(df, species, drop_na = TRUE) # Clean multiple columns at once df2 <- data.frame( genus = c("Homo", "Panthera", "Canis cf.", NA), species = c("sapiens", "leo ", "lupus", "arctos") ) taxon_cleaner(df2, c(genus, species))
Detects columns in a data frame that contain taxonomic names based on
column name patterns and value content. Returns a summary of detected
columns and their value counts, a named list mapping taxonomic ranks to
column indices, or both. Useful as a precursor to taxon_add
and taxon_sort to identify existing rank columns before
modifying the data frame.
taxon_column(df, output = "tibble")taxon_column(df, output = "tibble")
df |
A data frame. |
output |
Character. Format of the return value. One of
|
Detection uses a three-tier matching system applied to column names:
Strong match — column name contains a full taxonomic
keyword (taxon, species, genus, family,
order, class, phylum, kingdom,
scientificname).
Weak match (3–5 chars) — column name contains a substring of length 3–5 derived from a taxonomic keyword.
Weak match (1–2 chars) — column name contains a very short substring; used only for candidate columns not already captured by stronger tiers.
Columns matching geographic, temporal, or location-related terms
(latitude, longitude, country, date, etc.)
are excluded at each tier. Columns where more than 70% of non-NA
values are numeric and fewer than 20% contain letters are also excluded
as non-taxonomic.
When multiple columns match the same rank, all are assigned to that rank
in the "list" output. Use output = "list" inside
taxon_add with sort = TRUE to check for duplicate
rank assignments before sorting.
Depends on output:
"tibble"A tibble with three columns — column (the
detected column name), value (each unique non-NA value),
and count (number of occurrences) — sorted by count descending
within each column. Only strongly detected columns are included.
"list"A named list where each element corresponds to a
taxonomic rank (e.g. species, family) and contains a
named list mapping column name to column index in df. Includes
both strongly and weakly detected columns.
"both"A list with two elements: counts (the tibble
described above) and candidates (the named list described
above).
taxon_rank for detecting the rank of specific columns
by name,
taxon_add for appending higher taxonomic rank columns,
taxon_sort for sorting columns into standard taxonomic rank
order,
taxon_validate for validating detected columns using
update_related.
df <- data.frame( id = 1:4, species = c("Homo sapiens", "Panthera leo", "Canis lupus", "Ursus arctos"), family = c("Hominidae", "Felidae", "Canidae", "Ursidae"), count = c(10, 5, 8, 3) ) # Return a tibble of detected columns and value counts (default) taxon_column(df) # Return a named list mapping ranks to column indices taxon_column(df, output = "list") # Return both formats taxon_column(df, output = "both") # Use list output to inspect rank assignments before taxon_add taxon_column(df, output = "list")df <- data.frame( id = 1:4, species = c("Homo sapiens", "Panthera leo", "Canis lupus", "Ursus arctos"), family = c("Hominidae", "Felidae", "Canidae", "Ursidae"), count = c(10, 5, 8, 3) ) # Return a tibble of detected columns and value counts (default) taxon_column(df) # Return a named list mapping ranks to column indices taxon_column(df, output = "list") # Return both formats taxon_column(df, output = "both") # Use list output to inspect rank assignments before taxon_add taxon_column(df, output = "list")
Merges separate genus and epithet columns into a single binomial scientific
name column, appended to the data frame. Both columns are coerced to
character and joined with a single space, following standard binomial
nomenclature formatting. The inverse of this operation is
taxon_split.
taxon_combine(data, genus, epithet, new_column = NULL)taxon_combine(data, genus, epithet, new_column = NULL)
data |
A data frame. |
genus |
Column name of the genus column, supplied either unquoted
( |
epithet |
Column name of the specific epithet column, supplied either
unquoted ( |
new_column |
Optional. Unquoted or quoted name for the combined output
column. Default is |
No validation of genus or epithet values is performed. Use
taxon_cleaner to standardize formatting and remove uncertain
names before combining. The resulting binomial column can be passed
directly to taxon_add for higher rank look-ups or to
italicize for formatted ggplot2 labels.
The input data frame with one additional character column appended, named
according to new_column, containing values of the form
"<genus> <epithet>". The original genus and epithet columns are
retained. Rows where either input column is NA will produce
"NA <epithet>" or "<genus> NA" in the output column.
taxon_split for splitting a binomial name column back into
separate genus and epithet columns,
taxon_cleaner for standardising genus and epithet columns
before combining,
taxon_validate for validating the combined binomial name
against ITIS and GBIF,
taxon_add for looking up higher taxonomic ranks from the
combined binomial name,
italicize for formatting the combined name for
ggplot2 display.
df <- data.frame( genus = c("Homo", "Panthera", "Canis"), epithet = c("sapiens", "leo", "lupus") ) # Combine with default output column name taxon_combine(df, genus = genus, epithet = epithet) # Use a custom output column name taxon_combine(df, genus = genus, epithet = epithet, new_column = "binomial") # NA in either column is propagated as a string df_na <- data.frame( genus = c("Homo", NA, "Canis"), epithet = c("sapiens", "leo", NA) ) taxon_combine(df_na, genus = genus, epithet = epithet)df <- data.frame( genus = c("Homo", "Panthera", "Canis"), epithet = c("sapiens", "leo", "lupus") ) # Combine with default output column name taxon_combine(df, genus = genus, epithet = epithet) # Use a custom output column name taxon_combine(df, genus = genus, epithet = epithet, new_column = "binomial") # NA in either column is propagated as a string df_na <- data.frame( genus = c("Homo", NA, "Canis"), epithet = c("sapiens", "leo", NA) ) taxon_combine(df_na, genus = genus, epithet = epithet)
Infers the taxonomic rank (e.g. species, genus,
family) of one or more columns based on column name pattern
matching. Returns a named character vector of detected ranks, one per
input column. Useful for verifying rank assignments before calling
taxon_sort or taxon_add.
taxon_rank(data, columns)taxon_rank(data, columns)
data |
A data frame. |
columns |
Column name or |
Detection uses a two-tier approach applied to the lowercased column name:
Strong match — the column name contains a full taxonomic
keyword: scientificname, species, genus,
family, order, class, phylum,
kingdom, or taxon. Strong patterns are checked first and
take priority.
Weak match — for columns not assigned by a strong match, substrings of length 3–5 derived from the strong keywords are checked. The first matching keyword is assigned.
Detection is based on column names only — column values are not inspected.
For content-based detection across all columns in a data frame, use
taxon_column instead.
A named character vector the same length as columns, where names
are the input column names and values are the detected rank as a lowercase
string (e.g. "family", "genus"). Returns NA for any
column whose name does not match a recognised taxonomic rank pattern.
taxon_column for detecting taxonomic columns across an
entire data frame using both name and content patterns,
taxon_sort for sorting columns into standard taxonomic rank
order,
taxon_add for appending higher taxonomic rank columns,
taxon_validate for validation, which uses this function
internally to detect column rank.
df <- data.frame( genus = character(), family_name = character(), my_order = character(), site = character() ) # Detect rank of a single column taxon_rank(df, genus) # Detect ranks of multiple columns taxon_rank(df, c(genus, family_name, my_order)) # NA returned for columns with no recognisable rank pattern taxon_rank(df, c(genus, site))df <- data.frame( genus = character(), family_name = character(), my_order = character(), site = character() ) # Detect rank of a single column taxon_rank(df, genus) # Detect ranks of multiple columns taxon_rank(df, c(genus, family_name, my_order)) # NA returned for columns with no recognisable rank pattern taxon_rank(df, c(genus, site))
Reorders data frame columns so that detected taxonomic rank columns appear
in standard hierarchical order, with non-taxonomic columns preserved in
their original relative positions. Taxonomic columns are detected
automatically via taxon_column, or a custom column order can
be specified manually via manual.
taxon_sort(data, manual = FALSE)taxon_sort(data, manual = FALSE)
data |
A data frame. |
manual |
Optional. A numeric vector of at least two column indices
specifying a custom sort order for those columns. The selected columns
are moved to the position of the lowest index in |
The standard rank hierarchy used for automatic sorting is, from broadest
to most specific: kingdom, subkingdom, infrakingdom,
superphylum, phylum, subphylum, infraphylum,
superclass, class, subclass, infraclass,
superorder, order, suborder, infraorder,
superfamily, family, subfamily, tribe,
subtribe, genus, subgenus, species,
subspecies, variety.
Taxonomic columns not matching a rank in the hierarchy are appended after
the sorted known-rank columns. If multiple columns are detected for the
same rank, a warning is issued and the data frame is returned unsorted –
use taxon_column with output = "list" to inspect
assignments and either rename columns or use manual to specify the
desired order explicitly.
If no taxonomic columns are detected, a message is printed and the original data frame is returned unchanged.
The input data frame with taxonomic columns reordered, all other columns retained in their original relative positions. A console message reports the number of columns sorted, the insertion position, and the resulting column order. If duplicate rank assignments are detected, a warning is issued and the data frame is returned unchanged.
taxon_column for detecting taxonomic columns and inspecting
rank assignments before sorting,
taxon_add for appending higher taxonomic rank columns before
sorting,
taxon_validate for validating taxonomic names before sorting.
df <- data.frame( id = 1:3, species = c("Homo sapiens", "Panthera leo", "Canis lupus"), kingdom = c("Animalia", "Animalia", "Animalia"), family = c("Hominidae", "Felidae", "Canidae"), order = c("Primates", "Carnivora", "Carnivora") ) # Automatic sort into standard rank order taxon_sort(df) # Manual sort by column index taxon_sort(df, manual = c(3, 5, 4, 2)) # Inspect rank assignments first if sort returns a warning taxon_column(df, output = "list")df <- data.frame( id = 1:3, species = c("Homo sapiens", "Panthera leo", "Canis lupus"), kingdom = c("Animalia", "Animalia", "Animalia"), family = c("Hominidae", "Felidae", "Canidae"), order = c("Primates", "Carnivora", "Carnivora") ) # Automatic sort into standard rank order taxon_sort(df) # Manual sort by column index taxon_sort(df, manual = c(3, 5, 4, 2)) # Inspect rank assignments first if sort returns a warning taxon_column(df, output = "list")
Identifies and optionally corrects misspelled taxonomic names using
suggestions from a prior taxon_validate report, or by
running taxon_validate internally if no report is provided.
Names flagged as misspellings or phantoms with available suggestions are
reported and optionally applied; names with no suggestion are flagged for
manual review. When corrections are applied, genus columns detected by
taxon_column are updated automatically from the corrected
binomial. A spellcheck report is attached to the result as an attribute.
taxon_spellcheck( data, column, source = "both", update = FALSE, parallel = FALSE, max_synonym_depth = 3, validation_report = NULL )taxon_spellcheck( data, column, source = "both", update = FALSE, parallel = FALSE, max_synonym_depth = 3, validation_report = NULL )
data |
A data frame. |
column |
Column name or |
source |
Character. Taxonomic reference source passed to the internal
|
update |
Logical. If |
parallel |
Logical. If |
max_synonym_depth |
Integer. Maximum synonym redirect steps passed to
the internal |
validation_report |
Optional. A validation report tibble from a prior
|
When validation_report is NULL, taxon_validate
is run internally on the first column in column only. Passing a
pre-computed report via attr(validated, "validation_report") avoids
redundant API calls when taxon_validate has already been run.
Corrections are matched using match() on canonical names
(authorship stripped from both the input column and the suggestion before
comparison). Corrected values are written as canonical names without
authorship; use taxon_cite to append authorship after
correction.
When update = TRUE, genus columns detected by
taxon_column are updated for corrected rows by extracting
the first word of the corrected binomial. This only fires for rows
containing valid binomial names and skips the source column itself.
Names with status "unmatched" or phantoms without a suggestion are
listed separately in the console output for manual review and appear in
the report with NA in the suggestion column.
The input data frame, with names in column corrected to their
canonical form (authorship stripped) where update = TRUE and
corrections were available. A spellcheck report tibble is attached as
attr(result, "spellcheck_report") with columns:
columnName of the column checked.
originalThe original name as it appeared in the data.
suggestionThe suggested canonical correction (authorship
stripped), or NA if no suggestion is available.
confidenceNA in the current implementation;
reserved for future use.
sourceSource of the suggestion ("taxon_validate"),
or NA for names requiring manual review.
nNumber of rows containing the original name.
statusOne of "misspelling" (suggestion available),
"phantom" (name lacks authorship or publication data with a
suggestion), or "unmatched" (no match found in any source).
Only names with issues appear in the report. Names confirmed as valid are not included.
When validation_report is NULL, this function calls
taxon_validate internally, which queries GBIF and/or ITIS
web services and requires an active internet connection with reliable
access to those servers. To avoid network dependency, run
taxon_validate separately on a stable connection first and
pass the result via attr(result, "validation_report") to avoid
repeated API calls.
Connectivity can be tested before running spellcheck:
# Test ITIS connectivity
taxize::get_tsn("Homo sapiens", accepted = FALSE, verbose = TRUE,
messages = TRUE, ask = FALSE)
# Test GBIF connectivity
rgbif::name_backbone(name = "Homo sapiens", strict = TRUE)
taxon_validate for the underlying validation and synonym
resolution used to generate correction suggestions,
taxon_cleaner for standardising name formatting before
spellchecking,
taxon_column for detecting genus columns updated
automatically when update = TRUE,
taxon_add for appending higher taxonomic rank columns after
spellchecking,
taxon_cite for appending authorship after corrections are
applied.
df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Check spelling and report suggestions without applying corrections taxon_spellcheck(df, column = species) # Apply confirmed corrections to the column taxon_spellcheck(df, column = species, update = TRUE) # Pass a pre-computed validation report to avoid re-running taxon_validate validated <- taxon_validate(df, column = species) taxon_spellcheck(df, column = species, validation_report = attr(validated, "validation_report")) # Inspect the spellcheck report result <- taxon_spellcheck(df, column = species) attr(result, "spellcheck_report") # Check multiple columns at once df2 <- data.frame( species = c("Homo sapiens", "Panthera leo"), genus = c("Homo", "Panthara") ) taxon_spellcheck(df2, column = c(species, genus)) }df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Check spelling and report suggestions without applying corrections taxon_spellcheck(df, column = species) # Apply confirmed corrections to the column taxon_spellcheck(df, column = species, update = TRUE) # Pass a pre-computed validation report to avoid re-running taxon_validate validated <- taxon_validate(df, column = species) taxon_spellcheck(df, column = species, validation_report = attr(validated, "validation_report")) # Inspect the spellcheck report result <- taxon_spellcheck(df, column = species) attr(result, "spellcheck_report") # Check multiple columns at once df2 <- data.frame( species = c("Homo sapiens", "Panthera leo"), genus = c("Homo", "Panthara") ) taxon_spellcheck(df2, column = c(species, genus)) }
Splits a binomial scientific name column into separate genus and epithet
columns, appended to the data frame. The inverse of this operation is
taxon_combine. Only values matching strict binomial format
("Genus epithet") are split; non-conforming values produce
NA in both output columns.
taxon_split(data, column, genus = NULL, epithet = NULL, drop_na = FALSE)taxon_split(data, column, genus = NULL, epithet = NULL, drop_na = FALSE)
data |
A data frame. |
column |
Column name of the binomial name column to split, supplied
either unquoted ( |
genus |
Optional. Name for the genus output column. Default is
|
epithet |
Optional. Name for the epithet output column. Default is
|
drop_na |
Logical. If |
Splitting is performed by splitting on the first space. A value is
considered a valid binomial if it matches the pattern
"^[A-Z][a-z]+ [a-z]+$" — an initial-capitalised genus followed by
a single lowercase epithet, separated by one space. Values with
authorship, infraspecific ranks, uncertainty markers (cf.,
sp.), or extra whitespace will not match and produce NA.
Use taxon_cleaner to standardise formatting before splitting.
The input data frame with two additional character columns appended, named
according to genus and epithet. The original column
is retained. Values that do not match the expected binomial format produce
NA in both output columns.
taxon_combine for merging separate genus and epithet columns
into a binomial name,
taxon_cleaner for standardising binomial name formatting
before splitting,
taxon_validate for validating split columns against ITIS
and GBIF,
taxon_add for appending higher taxonomic rank columns after
splitting.
df <- data.frame( scientific_name = c("Homo sapiens", "Panthera leo", "Canis lupus") ) # Split with default output column names taxon_split(df, column = scientific_name) # Use custom output column names taxon_split(df, column = scientific_name, genus = "gen", epithet = "sp") # Non-conforming values produce NA in both output columns df_mixed <- data.frame( scientific_name = c("Homo sapiens", "Canis cf. lupus", "Ursus sp.", "panthera leo") ) taxon_split(df_mixed, column = scientific_name) # Drop rows that fail to split taxon_split(df_mixed, column = scientific_name, drop_na = TRUE)df <- data.frame( scientific_name = c("Homo sapiens", "Panthera leo", "Canis lupus") ) # Split with default output column names taxon_split(df, column = scientific_name) # Use custom output column names taxon_split(df, column = scientific_name, genus = "gen", epithet = "sp") # Non-conforming values produce NA in both output columns df_mixed <- data.frame( scientific_name = c("Homo sapiens", "Canis cf. lupus", "Ursus sp.", "panthera leo") ) taxon_split(df_mixed, column = scientific_name) # Drop rows that fail to split taxon_split(df_mixed, column = scientific_name, drop_na = TRUE)
Validates taxonomic names in a column against ITIS and/or GBIF, updating
synonyms to accepted canonical names and resolving authorship for matched
names. Validation proceeds in up to five passes: ITIS strict and synonym
matching, ITIS substitution search, GBIF strict matching, GBIF fuzzy
matching, and authorship resolution. Authorship is stored internally for
use by taxon_cite but is not written to the column —
validated columns contain canonical names only. A validation report is
attached to the result as an attribute.
taxon_validate( data, column, source = "both", update_related = FALSE, parallel = FALSE, max_synonym_depth = 3, drop_na = FALSE )taxon_validate( data, column, source = "both", update_related = FALSE, parallel = FALSE, max_synonym_depth = 3, drop_na = FALSE )
data |
A data frame. |
column |
Column name of the taxonomic column to validate, supplied
either unquoted ( |
source |
Character. Taxonomic reference source. One of |
update_related |
Logical. If |
parallel |
Logical. If |
max_synonym_depth |
Integer. Maximum number of synonym redirect steps
to follow in GBIF before accepting the current name. Default is |
drop_na |
Logical. If |
Validation proceeds in five sequential passes per column:
ITIS strict and synonym — names are looked up directly
in ITIS via taxize::get_tsn() and taxize::itis_getrecord();
synonyms are resolved to their accepted name. Only names matching the
pattern "^[A-Z][a-z]+" without digits or special characters are
submitted to ITIS.
ITIS substitution search — for names unmatched in pass 1,
genus and epithet substrings are compared against known values in the
column using edit distance (adist(), threshold ) to
suggest corrections.
GBIF strict — remaining unmatched names are looked up via
rgbif::name_backbone(strict = TRUE). Names where the genus lacks
authorship or publication data in both GBIF and ITIS are flagged as
phantoms.
GBIF fuzzy — names still unmatched are looked up with
strict = FALSE. Fuzzy matches differing from the input are
reported as misspelling suggestions only and not applied automatically.
Authorship resolution — all resolved canonical names are
enriched with authorship via GBIF with ITIS as a fallback. Authorship
is stored internally and used by taxon_cite; it is not
written to the column.
Results are memoised to disk via memoise and cachem in
tools::R_user_dir("taxon_add", "cache"). Repeated calls for the
same names are fast. Only unique non-NA values are looked up.
Required packages vary by source: rgbif for GBIF,
taxize for ITIS, and furrr and future for parallel
execution. Informative errors are raised if required packages are not
installed.
Use taxon_cleaner to standardise name formatting before
validation, and taxon_column to inspect related taxonomic
columns that can be updated via update_related.
The input data frame with taxonomic names in column updated to
accepted canonical names where matches were found. Authorship is not
written to the column; use taxon_cite to append authorship
after validation. A validation report tibble is attached as
attr(result, "validation_report") with columns:
columnName of the column validated.
originalThe original name as it appeared in the data.
acceptedThe accepted or suggested name, or NA if
unresolved.
nNumber of rows containing the original name.
statusOne of "updated" (synonym resolved to
accepted name), "misspelling" (fuzzy match suggestion
available), "phantom" (name lacks authorship or publication
data), or "unmatched" (no match found in any source).
Only names that were updated, flagged as misspellings, identified as phantoms, or left unmatched appear in the report. Confirmed valid names are not reported.
This function queries external web services (GBIF via rgbif and/or
ITIS via taxize) and requires an active internet connection with
reliable access to those servers. Performance on unstable or restricted
connections (e.g. public WiFi, VPN, or firewalled networks) may be slow
or produce incomplete results. Previously queried names are cached to disk
via memoise at tools::R_user_dir("taxon_add", "cache"), so
running on a stable connection first will speed up subsequent calls
regardless of connection quality.
Connectivity can be tested before running validation:
# Test ITIS connectivity
taxize::get_tsn("Homo sapiens", accepted = FALSE, verbose = TRUE,
messages = TRUE, ask = FALSE)
# Test GBIF connectivity
rgbif::name_backbone(name = "Homo sapiens", strict = TRUE)
taxon_cleaner for standardising taxonomic name formatting
before validation,
taxon_column for detecting related taxonomic columns updated
by update_related,
taxon_spellcheck for correcting misspellings flagged in the
validation report,
taxon_add for appending higher rank columns after validation,
taxon_cite for appending authorship after validation.
df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus familiaris") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Validate against both ITIS and GBIF taxon_validate(df, column = species) # Validate using GBIF only taxon_validate(df, column = species, source = "gbif") # Update related taxonomic columns for rows where names changed df2 <- data.frame( species = c("Homo sapiens", "Panthera leo"), genus = c("Homo", "Panthera"), family = c("Hominidae", "Felidae") ) taxon_validate(df2, column = species, update_related = TRUE) # Inspect the validation report result <- taxon_validate(df, column = species) attr(result, "validation_report") # Pass the report to taxon_spellcheck to correct misspellings result |> taxon_spellcheck(column = species, validation_report = attr(result, "validation_report"), update = TRUE) # Enable parallel API calls taxon_validate(df, column = species, parallel = TRUE) }df <- data.frame( species = c("Homo sapiens", "Panthera leo", "Canis lupus familiaris") ) if (requireNamespace("rgbif", quietly = TRUE) && requireNamespace("taxize", quietly = TRUE)) { # Validate against both ITIS and GBIF taxon_validate(df, column = species) # Validate using GBIF only taxon_validate(df, column = species, source = "gbif") # Update related taxonomic columns for rows where names changed df2 <- data.frame( species = c("Homo sapiens", "Panthera leo"), genus = c("Homo", "Panthera"), family = c("Hominidae", "Felidae") ) taxon_validate(df2, column = species, update_related = TRUE) # Inspect the validation report result <- taxon_validate(df, column = species) attr(result, "validation_report") # Pass the report to taxon_spellcheck to correct misspellings result |> taxon_spellcheck(column = species, validation_report = attr(result, "validation_report"), update = TRUE) # Enable parallel API calls taxon_validate(df, column = species, parallel = TRUE) }