Title: | Occurrence Data Cleaning |
---|---|
Description: | Flags and checks occurrence data that are in Darwin Core format. The package includes generic functions and data as well as some that are specific to bees. This package is meant to build upon and be complimentary to other excellent occurrence cleaning packages, including 'bdc' and 'CoordinateCleaner'. This package uses datasets from several sources and particularly from the Discover Life Website, created by Ascher and Pickering (2020). For further information, please see the original publication and package website. Publication - Dorey et al. (2023) <doi:10.1101/2023.06.30.547152> and package website - Dorey et al. (2023) <https://github.com/jbdorey/BeeBDC>. |
Authors: | James B. Dorey [aut, cre, cph] , Robert L. O'Reilly [aut] , Silas Bossert [aut] , Erica E. Fischer [aut] |
Maintainer: | James B. Dorey <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.2.1 |
Built: | 2024-11-05 06:45:36 UTC |
Source: | CRAN |
Downloads ALA data and creates a new file in the path to put those data. This function can also request downloads from other atlases (see: http://galah.ala.org.au/articles/choosing_an_atlas.html). However, it will only send the download to your email and you must do the rest yourself at this point.
atlasDownloader( path, userEmail = NULL, ALA_taxon, DL_reason = 4, atlas = "ALA" )
atlasDownloader( path, userEmail = NULL, ALA_taxon, DL_reason = 4, atlas = "ALA" )
path |
A character directory. The path to a folder where the download will be stored. |
userEmail |
A character string. The email used associated with the user's ALA account; user must make an ALA account to download data. |
ALA_taxon |
A character string. The taxon to download from ALA. Uses |
DL_reason |
Numeric. The reason for data download according to |
atlas |
Character. The atlas to download occurrence data from - see here https://galah.ala.org.au/R/articles/choosing_an_atlas.html for details. Note: the default is "ALA" and is probably the only atlas which will work seamlessly with the rest of the workflow. However, different atlases can still be downloaded and a doi will be sent to your email. |
Completes an ALA data download and saves those data to the path provided.
## Not run: atlasDownloader(path = DataPath, userEmail = "InsertYourEmail", ALA_taxon = "Apiformes", DL_reason = 4) ## End(Not run)
## Not run: atlasDownloader(path = DataPath, userEmail = "InsertYourEmail", ALA_taxon = "Apiformes", DL_reason = 4) ## End(Not run)
A simple function to return information about a particular species, including name validity and country occurrences.
BeeBDCQuery( beeName = NULL, searchChecklist = TRUE, printAllSynonyms = FALSE, beesChecklist = NULL, beesTaxonomy = NULL )
BeeBDCQuery( beeName = NULL, searchChecklist = TRUE, printAllSynonyms = FALSE, beesChecklist = NULL, beesTaxonomy = NULL )
beeName |
Character or character vector. A single or several bee species names to search for in the beesTaxonomy and beesChecklist tables. |
searchChecklist |
Logical. If TRUE (default), search the country checklist for each species. |
printAllSynonyms |
Logical. If TRUE, all synonyms will be printed out for each entered name. default = FALSE. |
beesChecklist |
A tibble. The bee checklist file for BeeBDC. If is NULL then
|
beesTaxonomy |
A tibble. The bee taxonomy file for BeeBDC. If is NULL then
|
Returns a list with the elements 'taxonomyReport' and 'SynonymReport'. IF searchChecklist is TRUE, then 'checklistReport' will also be returned.
# For the sake of these examples, we will use the example taxonomy and checklist system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load() system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # Single entry example testQuery <- BeeBDCQuery( beeName = "Lasioglossum bicingulatum", searchChecklist = TRUE, printAllSynonyms = TRUE, beesTaxonomy = testTaxonomy, beesChecklist = testChecklist) # Multiple entry example testQuery <- BeeBDCQuery( beeName = c("Lasioglossum bicingulatum", "Nomada flavopicta", "Lasioglossum fijiense (Perkins and Cheesman, 1928)"), searchChecklist = TRUE, printAllSynonyms = TRUE, beesTaxonomy = testTaxonomy, beesChecklist = testChecklist) # Example way to examine a report from the output list testQuery$checklistReport
# For the sake of these examples, we will use the example taxonomy and checklist system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load() system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # Single entry example testQuery <- BeeBDCQuery( beeName = "Lasioglossum bicingulatum", searchChecklist = TRUE, printAllSynonyms = TRUE, beesTaxonomy = testTaxonomy, beesChecklist = testChecklist) # Multiple entry example testQuery <- BeeBDCQuery( beeName = c("Lasioglossum bicingulatum", "Nomada flavopicta", "Lasioglossum fijiense (Perkins and Cheesman, 1928)"), searchChecklist = TRUE, printAllSynonyms = TRUE, beesTaxonomy = testTaxonomy, beesChecklist = testChecklist) # Example way to examine a report from the output list testQuery$checklistReport
This test dataset includes 105 random occurrence records from three bee species. The included species are: "Agapostemon tyleri Cockerell, 1917", "Centris rhodopus Cockerell, 1897", and "Perdita octomaculata (Say, 1824)".
data("bees3sp", package = "BeeBDC")
data("bees3sp", package = "BeeBDC")
An object of class "tibble"
Occurrence code generated in bdc or BeeBDC
Full scientificName as shown on DiscoverLife
Family name
Subfamily name
Genus name
Subgenus name
Full scientific name with subspecies name - ALA column
The species name (specific epithet) only
The subspecies name (intraspecific epithet) only
The full scientific name, with authorship and date information if known, of the currently valid (zoological) or accepted (botanical) taxon.
The taxonomic rank of the most specific name in the scientificName column.
The authorship information for the scientificName column formatted according to the conventions of the applicable nomenclaturalCode.
A brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the identification.
A list (concatenated and separated) of taxon names terminating at the rank immediately superior to the taxon referenced in the taxon record.
A list (concatenated and separated) of references (e.g. publications, global unique identifier, URI, etc.) used in the identification of the occurrence.
A list (concatenated and separated) of nomenclatural types (e.g. type status, typified scientific name, publication) applied to the occurrence.
A list (concatenated and separated) of previous assignments of names to the occurrence.
This term is meant to allow the capture of an unaltered original identification/determination, including identification qualifiers, hybrid formulas, uncertainties, etc. This term is meant to be used in addition to scientificName (and identificationQualifier etc.), not instead of it.
A list (concatenated and separated) of names of people, groups, or organizations who assigned the Taxon to the subject.
The date on which the occurrence was identified as belonging to a taxon.
The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a location. Positive values are north of the Equator, negative values are south of it, and valid values lie between -90 and 90, inclusive.
The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a location. Positive values are east of the Greenwich Meridian, and negative values are west of it. Valid values lie between -180 and 180, inclusive.
The name of the next smaller administrative region than country (e.g. state, province, canton, department, region, etc.) in which the location for the occurrence is found.
The name of the continent in which the location for the occurrence is found.
A specific description of the place the occurrence was found.
The name of the island on or near which the location for the occurrence is found, if applicable.
The full, unabbreviated name of the next smaller administrative region than stateProvince (e.g. county, shire, department, etc.) in which the location for the occurrence is found.
The full, unabbreviated name of the next smaller administrative region than county (e.g. city, municipality, etc.) in which the location for the occurrence is found. Do not use this term for a nearby named place that does not contain the actual location for the occurrence.
A legal document giving official permission to do something with the resource.
A GBIF-defined issue.
The time or interval during which the Event occurred. For occurrences, this is the time or interval when the event was recorded.
The time or interval during which an Event occurred.
The integer day of the month on which the Event occurred. For occurrences, this is the day when the event was recorded.
The integer month in which the Event occurred. For occurrences, this is the month of when the event was recorded.
The four-digit year in which the Event occurred, according to the Common Era Calendar. For occurrences, this is the year when the event was recorded.
The specific nature of the data record. Recommended best practice is to use the standard label of one of the Darwin Core classes.PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation
The name of the country or major administrative unit in which the location for the occurrence is found.
The nature or genre of the resource. StillImage, MovingImage, Sound, PhysicalObject, Event, Text.
A statement about the presence or absence of a Taxon at a Location. present, absent.
An identifier given to the Occurrence at the time it was recorded. Often serves as a link between field notes and an Occurrence record, such as a specimen collector's number.
A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence. The primary collector or observer, especially one who applies a personal identifier (recordNumber), should be listed first.
An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
A spatial region or named place.
The names of, references to, or descriptions of the methods or protocols used during an Event. Examples UV light trap, mist net, bottom trawl, ad hoc observation | point count, Penguins from space: faecal stains reveal the location of emperor penguin colonies, https://doi.org/10.1111/j.1466-8238.2009.00467.x, Takats et al. 2001.
The amount of effort expended during an Event. Examples 40 trap-nights, 10 observer-hours, 10 km by foot, 30 km by car.
The number of individuals present at the time of the Occurrence. Integer.
A number or enumeration value for the quantity of organisms. Examples 27 (organismQuantity) with individuals (organismQuantityType). 12.5 (organismQuantity) with percentage biomass (organismQuantityType). r (organismQuantity) with Braun Blanquet Scale (organismQuantityType). many (organismQuantity) with individuals (organismQuantityType).
A decimal representation of the precision of the coordinates given in the decimalLatitude and decimalLongitude.
The horizontal distance (in meters) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term.
Occurrence records in the ALA can be filtered by using the spatially valid flag. This flag combines a set of tests applied to the record to see how reliable are its spatial data components.
An identifier (preferably unique) for the record within the data set or collection.
The identifier assigned by GBIF for each record.
An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.
The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. Examples MVZ, FMNH, CLO, UCMP.
The name identifying the data set from which the record was derived.
A list (concatenated and separated) of previous or alternate fully qualified catalog numbers or other human-used identifiers for the same Occurrence, whether in the current or any other data set or collection.
An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.
The GBIF-assigned taxon identifier number.
An identifier for the collection or dataset from which the record was derived.
Scientific name as recorded on specimen label, not necessarily valid.
The verbatim original representation of the date and time information for an event. For occurrences, this is the date-time when the event was recorded as noted by the collector.
A list (concatenated and separated) of identifiers or names of taxa and the associations of this occurrence to each of them.
A list (concatenated and separated) of identifiers of other Organisms and the associations of this occurrence to each of them.
One of (a) an indicator of the existence of, (b) a reference to (publication, URI), or (c) the text of notes taken in the field about the Event.
The sex of the biological individual(s) represented in the Occurrence.
A description of the usage rights applicable to the record.
A person or organization owning or managing rights over the resource.
Information about who can access the resource or an indication of its security status.
A list (concatenated and separated) of identifiers (publication, bibliographic reference, global unique identifier, URI) of literature associated with the Occurrence.
A bibliographic reference for the resource as a statement indicating how this record should be cited (attributed) when used.
A related resource that is referenced, cited, or otherwise pointed to by the described resource.
Additional information that exists, but that has not been shared in the given record.
The code for another occerrence but for the same specimen.
Variable indicating presence/absence of location coordinates.
Variable indicating validity of geospatial data associated with record.
Year associated with Occurrence.
Variable with identifying value for the Occurrenc.
Variable indicating is Occurrence is duplicate or not.
A list (concatenated and separated) of identifiers of other occurrence records and their associations to this occurrence.
Comments or notes about the Location.
BeeBDC assigned source of the data. Often written when the data is formatted by a BeeBDC::xxx_readr function or similar.
The verbatim (originally-provided) scientific name
Flag produced by bdc::bdc_scientificName_empty()
where FALSE == no scientific name provided and TRUE means that there is text in that column.
Flag produced by bdc::bdc_coordinates_empty()
where FALSE == no coordinates provided.
Flag column produced by bdc::bdc_coordinates_outOfRange() where FALSE == coordinates represent a point off of the Earth. This is to say, the function identifies records with out-of-range coordinates (not between -90 and 90 for latitude; not between -180 and 180 for longitude).
Flag produced by bdc::bdc_basisOfRecords_notStandard()
where FALSE == an occurrence with a basisOfRecord not defined as acceptable by the user.
A country name suggested by the bdc::bdc_country_standardized()
function.
A country code suggested by the bdc::bdc_country_standardized()
function.
A column indicating if coordinates were identified as being transposed by the function jbd_Ctrans_chunker()
where FALSE == transposed.
A flag generated by jbd_coordCountryInconsistent()
where FALSE == an occurrence where the country name and coordinates did not match.
A flag generated by flagAbsent()
where FALSE == occurrences marked as "ABSENT" in the "occurrenceStatus" column
A flag generated by flagLicense()
where FALSE == those occurrences protected by a restrictive license.
A flag generated by GBIFissues()
where FALSE == an occurrence with user-specified GBIF issues to flag.
A flag generated by bdc::bdc_clean_names()
where FALSE == the presence of taxonomic uncertainty terms.
A column made by bdc::bdc_clean_names()
indicating the cleaned scientificName
A flag generated by harmoniseR()
where FALSE == occurrences whose scientificName did not match the Discover Life taxonomy.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == rounded (probably imprecise) coordinates.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == invalid coordinates.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == equal coordinates (e.g., 0.1, 0.1).
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == zeros as coordinates
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around country capital centroid.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around country or province centroids.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around the GBIF headquarters.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around biodiversity institutions.
A flag generated by diagonAlley()
where FALSE == records that are possibly the result of fill-down errors in sequence.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in the longitude column within dataset.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in the latitude column within dataset.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in either the longitude or latitude columns within dataset.
A flag generated by coordUncerFlagR()
where FALSE == occurrences that did not pass a user-specified threshold in the "coordinateUncertaintyInMeters" column.
A column made by countryOutlieRs()
. Summarises the occurrence-level result: where the species is not known to occur in that country (noMatch), it is known from a bordering country (neighbour), or it is known to occur in that country (exact).
A flag generated by countryOutlieRs()
where FALSE == occurrences the do not occur in a country that concurs with the Discover Life country checklist OR an adjacent country.
A flag generated by countryOutlieRs()
where FALSE == occurrences that are in the ocean.
A flag generated by summaryFun()
where FALSE == occurrences flagged as FALSE in any of the .flag columns. In this example it excludes flags in the ".gridSummary", ".lonFlag", ".latFlag", and ".uncer_terms" columns.
A flag generated by bdc::bdc_eventDate_empty()
where FALSE == occurrences with no eventDate provided.
A flag column generated by bdc::bdc_year_outOfRange()
where FALSE == occurrences older than a threshold date. In the case of the bee dataset used in this package, the lower threshold is 1950
A flag generated by dupeSummary()
where FALSE == occurrences identified as duplicates. There will be an associated kept duplicate (.duplictes == TRUE) for all duplicate clusters.
A small bee occurrence dataset with flags generated by BeeBDC which can be used to run the
example script and to test functions. For data types, see ColTypeR()
.
This data set was created by generating a random subset of 105 rows from the full BeeBDC dataset from the publication: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
bees3sp <- BeeBDC::bees3sp head(bees3sp)
bees3sp <- BeeBDC::bees3sp head(bees3sp)
Download the table contains taxonomic and country information for the bees of the world based on data collated on Discover Life. The data will be sourced from the BeeBDC article's Figshare.
Note that sometimes the download might not work without restarting R. In this case, you could
alternatively download the dataset from the URL below and then read it in using
base::readRDS("filePath.Rda")
.
beesChecklist(URL = "https://figshare.com/ndownloader/files/47092720", ...)
beesChecklist(URL = "https://figshare.com/ndownloader/files/47092720", ...)
URL |
A character vector to the FigShare location of the dataset. The default will be to the most-recent version. |
... |
Extra variables that can be passed to |
A downloaded beesChecklist.Rda file in the outPath and the same tibble returned to the environment.
**Column details **
validName The valid scientificName as it should occur in the scientificName column.
DiscoverLife_name The full country name as it occurs on Discover Life.
rNaturalEarth_name Country name from rnaturalearth's name_long and type = "map_units".
shortName A short version of the country name.
continent The continent where that country is found.
DiscoverLife_ISO The ISO country name as it occurs on Discover Life.
Alpha-2 Alpha-2 from rnaturalearth.
iso_a3_eh iso_a3_eh from rnaturalearth.
official Official country name = "yes" or only a Discover Life name = "no".
Source A text strign denoting the source or author of the name-country pair.
matchCertainty Quality of the name's match to the Discover Life checklist.
canonical The valid species name without scientificNameAuthority.
canonical_withFlags The validName without the scientificNameAuthority but with Discover Life flags.
family Bee family.
subfamily Bee subfamily.
genus Bee genus.
subgenus Bee subgenus.
infraspecies Bee infraSpecificEpithet.
species Bee specificEpithet.
scientificNameAuthorship Bee scientificNameAuthorship.
taxon_rank Rank of the taxon name.
Notes Discover Life country name notes.
This dataset was created using the Discover Life checklist and taxonomy. Dataset is from the publication: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W The checklist data are mostly compiled from Discover Life data, www.discoverlife.org: Ascher, J.S. & Pickering, J. (2020) Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). http://www.discoverlife.org/mp/20q?guide=Apoidea_species
beesTaxonomy()
for further context.
## Not run: beesChecklist <- BeeBDC::beesChecklist() ## End(Not run)
## Not run: beesChecklist <- BeeBDC::beesChecklist() ## End(Not run)
A small bee occurrence dataset with flags generated by BeeBDC used to run example script and test
functions. For data types, see ColTypeR()
.
data("beesFlagged", package = "BeeBDC")
data("beesFlagged", package = "BeeBDC")
An object of class "tibble"
Occurrence code generated in bdc or BeeBDC
Full scientificName as shown on DiscoverLife
Family name
Subfamily name
Genus name
Subgenus name
Full name with subspecies name - ALA column
The species name only
The subspecies name only
The full name, with authorship and date information if known, of the currently valid (zoological) or accepted (botanical) taxon.
The taxonomic rank of the most specific name in the scientificName.
The authorship information for the scientificName formatted according to the conventions of the applicable nomenclaturalCode.
A brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the Identification.
A list (concatenated and separated) of taxa names terminating at the rank immediately superior to the taxon referenced in the taxon record.)
A list (concatenated and separated) of references (publication, global unique identifier, URI) used in the Identification.
A list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject.
A list (concatenated and separated) of previous assignments of names to the Organism.
This term is meant to allow the capture of an unaltered original identification/determination, including identification qualifiers, hybrid formulas, uncertainties, etc. This term is meant to be used in addition to scientificName (and identificationQualifier etc.), not instead of it.
A list (concatenated and separated) of names of people, groups, or organizations who assigned the Taxon to the subject.
The date on which the subject was determined as representing the Taxon.
The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive.
The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive.
The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the Location occurs.
The name of the continent in which the Location occurs.
The specific description of the place.
The name of the island on or near which the Location occurs.
The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, etc.) in which the Location occurs.
The full, unabbreviated name of the next smaller administrative region than county (city, municipality, etc.) in which the Location occurs. Do not use this term for a nearby named place that does not contain the actual location.
A legal document giving official permission to do something with the resource.
A GBIF-defined issue.
The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context.
The time or interval during which an Event occurred.
The integer day of the month on which the Event occurred.
The integer month in which the Event occurred.
The four-digit year in which the Event occurred, according to the Common Era Calendar.
The specific nature of the data record. Recommended best practice is to use the standard label of one of the Darwin Core classes.PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation
The name of the country or major administrative unit in which the Location occurs.
The nature or genre of the resource. StillImage, MovingImage, Sound, PhysicalObject, Event, Text.
A statement about the presence or absence of a Taxon at a Location. present, absent.
An identifier given to the Occurrence at the time it was recorded. Often serves as a link between field notes and an Occurrence record, such as a specimen collector's number.
A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence. The primary collector or observer, especially one who applies a personal identifier (recordNumber), should be listed first.
An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
A spatial region or named place.
The names of, references to, or descriptions of the methods or protocols used during an Event. Examples UV light trap, mist net, bottom trawl, ad hoc observation | point count, Penguins from space: faecal stains reveal the location of emperor penguin colonies, https://doi.org/10.1111/j.1466-8238.2009.00467.x, Takats et al. 2001.
The amount of effort expended during an Event. Examples 40 trap-nights, 10 observer-hours, 10 km by foot, 30 km by car.
The number of individuals present at the time of the Occurrence. Integer.
A number or enumeration value for the quantity of organisms. Examples 27 (organismQuantity) with individuals (organismQuantityType). 12.5 (organismQuantity) with percentage biomass (organismQuantityType). r (organismQuantity) with Braun Blanquet Scale (organismQuantityType). many (organismQuantity) with individuals (organismQuantityType).
A decimal representation of the precision of the coordinates given in the decimalLatitude and decimalLongitude.
The horizontal distance (in meters) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term.
Occurrence records in the ALA can be filtered by using the spatially valid flag. This flag combines a set of tests applied to the record to see how reliable are its spatial data components.
An identifier (preferably unique) for the record within the data set or collection.
The identifier assigned by GBIF for each record.
An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.
The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. Examples MVZ, FMNH, CLO, UCMP.
The name identifying the data set from which the record was derived.
A list (concatenated and separated) of previous or alternate fully qualified catalog numbers or other human-used identifiers for the same Occurrence, whether in the current or any other data set or collection.
An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.
The GBIF-assigned taxon identifier number.
An identifier for the collection or dataset from which the record was derived.
The verbatim (originally-provided) scientific name
The verbatim original representation of the date and time information for an Event.
A list (concatenated and separated) of identifiers or names of taxa and the associations of this Occurrence to each of them.
A list (concatenated and separated) of identifiers of other Organisms and the associations of this Organism to each of them.
One of a) an indicator of the existence of, b) a reference to (publication, URI), or c) the text of notes taken in the field about the Event.
The sex of the biological individual(s) represented in the Occurrence.
A description of the usage rights applicable to the record.
A person or organization owning or managing rights over the resource.
Information about who can access the resource or an indication of its security status.
A list (concatenated and separated) of identifiers (publication, bibliographic reference, global unique identifier, URI) of literature associated with the Occurrence.
A bibliographic reference for the resource as a statement indicating how this record should be cited (attributed) when used.
A related resource that is referenced, cited, or otherwise pointed to by the described resource.
Additional information that exists, but that has not been shared in the given record.
Additional information that exists, but that has not been shared in the given record.
Variable indicating presence/absence of location coordinates.
Variable indicating validity of geospatial data associated with record.
Year associated with Occurrence.
Variable with identifying value for the Occurrenc.
Variable indicating is Occurrence is duplicate or not.
A list (concatenated and separated) of identifiers of other Occurrence records and their associations to this Occurrence.
Comments or notes about the Location.
BeeBDC assigned source of the data. Often written when the data is formatted by a BeeBDC::xxx_readr function or similar.
The verbatim (originally-provided) scientific name
Flag produced by bdc::bdc_scientificName_empty()
where FALSE == no scientific name provided and TRUE means that there is text in that column.
Flag produced by bdc::bdc_coordinates_empty()
where FALSE == no coordinates provided.
Flag produced by bdc::bdc_coordinates_outOfRange()
where FALSE == point off the earth. This function identifies records with out-of-range coordinates (not between -90 and 90 for latitude; between -180 and 180 for longitude).
Flag produced by bdc::bdc_basisOfRecords_notStandard()
where FALSE == an occurrence with a basisOfRecord not defined as acceptable by the user.
A country name suggested by the bdc::bdc_country_standardized()
function.
A country code suggested by the bdc::bdc_country_standardized()
function.
A column indicating if coordinates were tansposed by jbd_Ctrans_chunker()
where FALSE == transposed.
A flag generated by jbd_coordCountryInconsistent()
where FALSE == an occurrence where the country name and coordinates did not match.
A flag generated by flagAbsent()
where FALSE == occurrences marked as "ABSENT" in the "occurrenceStatus" column
A flag generated by flagLicense()
where FALSE == those occurrences protected by a restrictive license.
A flag generated by GBIFissues()
where FALSE == an occurrence with user-specified GBIF issues to flag.
A flag generated by bdc::bdc_clean_names()
where FALSE == the presence of taxonomic uncertainty terms.
A column made by bdc::bdc_clean_names()
indicating the cleaned scientificName
A flag generated by harmoniseR()
where FALSE == occurrences whose scientificName did not match the Discover Life taxonomy.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == rounded (probably imprecise) coordinates.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == invalid coordinates.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == equal coordinates (e.g., 0.1, 0.1).
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == zeros as coordinates
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around country capital centroid.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around country or province centroids.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around the GBIF headquarters.
A flag generated by CoordinateCleaner::clean_coordinates()
where FALSE == records around biodiversity institutions.
A flag generated by diagonAlley()
where FALSE == records that are possibly the result of fill-down errors in sequence.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in the longitude column within dataset.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in the latitude column within dataset.
A flag generated by CoordinateCleaner::cd_round()
where FALSE == potential gridding in either the longitude or latitude columns within dataset.
A flag generated by coordUncerFlagR()
where FALSE == occurrences that did not pass a user-specified threshold in the "coordinateUncertaintyInMeters" column.
A column made by countryOutlieRs()
. Summarises the occurrence-level result: where the species is not known to occur in that country (noMatch), it is known from a bordering country (neighbour), or it is known to occur in that country (exact).
A flag generated by countryOutlieRs()
where FALSE == occurrences the do not occur in a country that concurs with the Discover Life country checklist OR an adjacent country.
A flag generated by countryOutlieRs()
where FALSE == occurrences that are in the ocean.
A flag generated by summaryFun()
where FALSE == occurrences flagged as FALSE in any of the .flag columns. In this example it excludes flags in the ".gridSummary", ".lonFlag", ".latFlag", and ".uncer_terms" columns.
A flag generated by bdc::bdc_eventDate_empty()
where FALSE == occurrences with no eventDate provided.
A flag generated by bdc::bdc_year_outOfRange()
where FALSE == occurrences older than a threshold date. In this case 1950.
A flag generated by dupeSummary()
where FALSE == occurrences identified as duplicates. There will be an associated kept duplicate (.duplictes == TRUE) for all duplicate clusters.
This data set was created by generating a random subset of 100 rows from the full BeeBDC dataset from the publication: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
beesFlagged <- BeeBDC::beesFlagged head(beesFlagged)
beesFlagged <- BeeBDC::beesFlagged head(beesFlagged)
A small bee occurrence dataset with flags generated by BeeBDC used to run example script and test
functions. For data types, see ColTypeR()
.
data("beesRaw", package = "BeeBDC")
data("beesRaw", package = "BeeBDC")
An object of class "tibble"
Occurrence code generated in bdc or BeeBDC
Full scientificName as shown on DiscoverLife
Family name
Subfamily name
Genus name
Subgenus name
Full name with subspecies name - ALA column
The species name only
The subspecies name only
The full name, with authorship and date information if known, of the currently valid (zoological) or accepted (botanical) taxon.
The taxonomic rank of the most specific name in the scientificName.
The authorship information for the scientificName formatted according to the conventions of the applicable nomenclaturalCode.
A brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the Identification.
A list (concatenated and separated) of taxa names terminating at the rank immediately superior to the taxon referenced in the taxon record.)
A list (concatenated and separated) of references (publication, global unique identifier, URI) used in the Identification.
A list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject.
A list (concatenated and separated) of previous assignments of names to the Organism.
This term is meant to allow the capture of an unaltered original identification/determination, including identification qualifiers, hybrid formulas, uncertainties, etc. This term is meant to be used in addition to scientificName (and identificationQualifier etc.), not instead of it.
A list (concatenated and separated) of names of people, groups, or organizations who assigned the Taxon to the subject.
The date on which the subject was determined as representing the Taxon.
The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive.
The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive.
The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the Location occurs.
The name of the continent in which the Location occurs.
The specific description of the place.
The name of the island on or near which the Location occurs.
The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, etc.) in which the Location occurs.
The full, unabbreviated name of the next smaller administrative region than county (city, municipality, etc.) in which the Location occurs. Do not use this term for a nearby named place that does not contain the actual location.
A legal document giving official permission to do something with the resource.
A GBIF-defined issue.
The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context.
The time or interval during which an Event occurred.
The integer day of the month on which the Event occurred.
The integer month in which the Event occurred.
The four-digit year in which the Event occurred, according to the Common Era Calendar.
The specific nature of the data record. Recommended best practice is to use the standard label of one of the Darwin Core classes.PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation
The name of the country or major administrative unit in which the Location occurs.
The nature or genre of the resource. StillImage, MovingImage, Sound, PhysicalObject, Event, Text.
A statement about the presence or absence of a Taxon at a Location. present, absent.
An identifier given to the Occurrence at the time it was recorded. Often serves as a link between field notes and an Occurrence record, such as a specimen collector's number.
A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence. The primary collector or observer, especially one who applies a personal identifier (recordNumber), should be listed first.
An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
A spatial region or named place.
The names of, references to, or descriptions of the methods or protocols used during an Event. Examples UV light trap, mist net, bottom trawl, ad hoc observation | point count, Penguins from space: faecal stains reveal the location of emperor penguin colonies, https://doi.org/10.1111/j.1466-8238.2009.00467.x, Takats et al. 2001.
The amount of effort expended during an Event. Examples 40 trap-nights, 10 observer-hours, 10 km by foot, 30 km by car.
The number of individuals present at the time of the Occurrence. Integer.
A number or enumeration value for the quantity of organisms. Examples 27 (organismQuantity) with individuals (organismQuantityType). 12.5 (organismQuantity) with percentage biomass (organismQuantityType). r (organismQuantity) with Braun Blanquet Scale (organismQuantityType). many (organismQuantity) with individuals (organismQuantityType).
A decimal representation of the precision of the coordinates given in the decimalLatitude and decimalLongitude.
The horizontal distance (in meters) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term.
Occurrence records in the ALA can be filtered by using the spatially valid flag. This flag combines a set of tests applied to the record to see how reliable are its spatial data components.
An identifier (preferably unique) for the record within the data set or collection.
The identifier assigned by GBIF for each record.
An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.
The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. Examples MVZ, FMNH, CLO, UCMP.
The name identifying the data set from which the record was derived.
A list (concatenated and separated) of previous or alternate fully qualified catalog numbers or other human-used identifiers for the same Occurrence, whether in the current or any other data set or collection.
An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.
The GBIF-assigned taxon identifier number.
An identifier for the collection or dataset from which the record was derived.
The verbatim (originally-provided) scientific name
The verbatim original representation of the date and time information for an Event.
A list (concatenated and separated) of identifiers or names of taxa and the associations of this Occurrence to each of them.
A list (concatenated and separated) of identifiers of other Organisms and the associations of this Organism to each of them.
One of a) an indicator of the existence of, b) a reference to (publication, URI), or c) the text of notes taken in the field about the Event.
The sex of the biological individual(s) represented in the Occurrence.
A description of the usage rights applicable to the record.
A person or organization owning or managing rights over the resource.
Information about who can access the resource or an indication of its security status.
A list (concatenated and separated) of identifiers (publication, bibliographic reference, global unique identifier, URI) of literature associated with the Occurrence.
A bibliographic reference for the resource as a statement indicating how this record should be cited (attributed) when used.
A related resource that is referenced, cited, or otherwise pointed to by the described resource.
Additional information that exists, but that has not been shared in the given record.
Additional information that exists, but that has not been shared in the given record.
Variable indicating presence/absence of location coordinates.
Variable indicating validity of geospatial data associated with record.
Year associated with Occurrence.
Variable with identifying value for the Occurrenc.
Variable indicating is Occurrence is duplicate or not.
A list (concatenated and separated) of identifiers of other Occurrence records and their associations to this Occurrence.
Comments or notes about the Location.
BeeBDC assigned source of the data. Often written when the data is formatted by a BeeBDC::xxx_readr function or similar.
The verbatim (originally-provided) scientific name
This data set was created by generating a random subset of 100 rows from the full, unfiltered and unflagged, BeeBDC dataset from the publication: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
beesRaw <- BeeBDC::beesRaw head(beesRaw)
beesRaw <- BeeBDC::beesRaw head(beesRaw)
Downloads the taxonomic information for the bees of the world. Source of taxonomy is listed under "source" but are mostly derived from the Discover Life website. The data will be sourced from the BeeBDC article's Figshare.
Note that sometimes the download might not work without restarting R. In this case, you could
alternatively download the dataset from the URL below and then read it in using
base::readRDS("filePath.Rda")
.
beesTaxonomy( URL = "https://open.flinders.edu.au/ndownloader/files/47089969", ... )
beesTaxonomy( URL = "https://open.flinders.edu.au/ndownloader/files/47089969", ... )
URL |
A character vector to the FigShare location of the dataset. The default will be to the most-recent version. |
... |
Extra variables that can be passed to |
Column details
flags Flags or comments about the taxon name.
taxonomic_status Taxonomic status. Values are "accepted" or "synonym"
source Source of the name.
accid The id of the accepted taxon name or "0" if taxonomic_status == accepted.
id The id number for the taxon name.
kingdom The biological kingdom the taxon belongs to. For bees, kingdom == Animalia.
phylum The biological phylum the taxon belongs to. For bees, phylum == Arthropoda.
class The biological class the taxon belongs to. For bees, class == Insecta.
order The biological order the taxon belongs to. For bees, order == Hymenoptera.
family The family of bee which the species belongs to.
subfamily The subfamily of bee which the species belongs to.
tribe The tribe of bee which the species belongs to.
subtribe The subtribe of bee which the species belongs to.
validName The valid scientific name as it should occur in the 'scientificName" column in a Darwin Core file.
canonical The scientificName without the scientificNameAuthority.
canonical_withFlags The scientificName without the scientificNameAuthority and with Discover Life taxonomy flags.
genus The genus the bee species belongs to.
subgenus The subgenus the bee species belongs to.
species The specific epithet for the bee species.
infraspecies The infraspecific epithet for the bee addressed.
authorship The author who described the bee species.
taxon_rank Rank for the bee taxon addressed in the entry.
notes Additional notes about the name/taxon.
A downloaded beesTaxonomy.Rda file in the tempdir()
and the same tibble returned to
the environment.
This dataset was created using the Discover Life taxonomy. Dataset is from the publication: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W The taxonomy data are mostly compiled from Discover Life data, www.discoverlife.org: Ascher, J.S. & Pickering, J. (2020) Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). http://www.discoverlife.org/mp/20q?guide=Apoidea_species
taxadbToBeeBDC()
to download any other taxonomy (of any taxa or of bees)
and harmoniseR()
for the
taxon-cleaning function where these taxonomies are implemented. It may also be worth seeing
beesChecklist()
.
## Not run: beesTaxonomy <- BeeBDC::beesTaxonomy() ## End(Not run)
## Not run: beesTaxonomy <- BeeBDC::beesTaxonomy() ## End(Not run)
This function outputs a figure which shows the relative size and direction of occurrence points
duplicated between data providers, such as, SCAN, GBIF, ALA, etc. This function requires the
outputs generated by dupeSummary()
.
chordDiagramR( dupeData = NULL, outPath = NULL, fileName = NULL, width = 7, height = 6, bg = "white", smallGrpThreshold = 3, title = "Duplicated record sources", palettes = c("cartography::blue.pal", "cartography::green.pal", "cartography::sand.pal", "cartography::orange.pal", "cartography::red.pal", "cartography::purple.pal", "cartography::brown.pal"), canvas.ylim = c(-1, 1), canvas.xlim = c(-0.6, 0.25), text.col = "black", legendX = grid::unit(6, "mm"), legendY = grid::unit(18, "mm"), legendJustify = c("left", "bottom"), niceFacing = TRUE, self.link = 2 )
chordDiagramR( dupeData = NULL, outPath = NULL, fileName = NULL, width = 7, height = 6, bg = "white", smallGrpThreshold = 3, title = "Duplicated record sources", palettes = c("cartography::blue.pal", "cartography::green.pal", "cartography::sand.pal", "cartography::orange.pal", "cartography::red.pal", "cartography::purple.pal", "cartography::brown.pal"), canvas.ylim = c(-1, 1), canvas.xlim = c(-0.6, 0.25), text.col = "black", legendX = grid::unit(6, "mm"), legendY = grid::unit(18, "mm"), legendJustify = c("left", "bottom"), niceFacing = TRUE, self.link = 2 )
dupeData |
A tibble or data frame. The duplicate file produced by |
outPath |
Character. The path to a directory (folder) in which the output should be saved. |
fileName |
Character. The name of the output file, ending in '.pdf'. |
width |
Numeric. The width of the figure to save (in inches). Default = 7. |
height |
Numeric. The height of the figure to save (in inches). Default = 6. |
bg |
The plot's background colour. Default = "white". |
smallGrpThreshold |
Numeric. The upper threshold of sub-dataSources to be listed as "other". Default = 3. |
title |
A character string. The figure title. Default = "Duplicated record sources". |
palettes |
A vector of the palettes to be used. One palette for each major dataSource and "other"
using the |
canvas.ylim |
Canvas limits from |
canvas.xlim |
Canvas limits from |
text.col |
A character string. Text colour |
legendX |
The x position of the legends, as measured in current viewport. Passed to ComplexHeatmap::draw(). Default = grid::unit(6, "mm"). |
legendY |
The y position of the legends, as measured in current viewport. Passed to ComplexHeatmap::draw(). Default = grid::unit(18, "mm"). |
legendJustify |
A character vector declaring the justification of the legends. Passed to ComplexHeatmap::draw(). Default = c("left", "bottom"). |
niceFacing |
TRUE/FALSE. The niceFacing option automatically adjusts the text facing
according to their positions in the circle. Passed to |
self.link |
1 or 2 (numeric). Passed to |
Saves a figure to the provided file path.
## Not run: # Create a basic example dataset of duplicates to visualise basicData <- dplyr::tribble( ~dataSource, ~dataSource_keep, "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "SCAN_Halictidae", "GBIF_Halictidae", "iDigBio_halictidae", "GBIF_Halictidae", "iDigBio_halictidae", "SCAN_Halictidae", "iDigBio_halictidae", "SCAN_Halictidae", "SCAN_Halictidae", "GBIF_Halictidae", "iDigBio_apidae", "SCAN_Apidae", "SCAN_Apidae", "Ecd_Anthophila", "iDigBio_apidae", "Ecd_Anthophila", "SCAN_Apidae", "Ecd_Anthophila", "iDigBio_apidae", "Ecd_Anthophila", "SCAN_Megachilidae", "SCAN_Megachilidae", "CAES_Anthophila", "CAES_Anthophila", "CAES_Anthophila", "CAES_Anthophila" ) chordDiagramR( dupeData = basicData, outPath = tempdir(), fileName = "ChordDiagram.pdf", # These can be modified to help fit the final pdf that's exported. width = 9, height = 7.5, bg = "white", # How few distinct dataSources should a group have to be listed as "other" smallGrpThreshold = 3, title = "Duplicated record sources", # The default list of colour palettes to choose from using the paleteer package palettes = c("cartography::blue.pal", "cartography::green.pal", "cartography::sand.pal", "cartography::orange.pal", "cartography::red.pal", "cartography::purple.pal", "cartography::brown.pal"), canvas.ylim = c(-1.0,1.0), canvas.xlim = c(-0.6, 0.25), text.col = "black", legendX = grid::unit(6, "mm"), legendY = grid::unit(18, "mm"), legendJustify = c("left", "bottom"), niceFacing = TRUE) ## End(Not run)
## Not run: # Create a basic example dataset of duplicates to visualise basicData <- dplyr::tribble( ~dataSource, ~dataSource_keep, "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "GBIF_Halictidae", "USGS_data", "SCAN_Halictidae", "GBIF_Halictidae", "iDigBio_halictidae", "GBIF_Halictidae", "iDigBio_halictidae", "SCAN_Halictidae", "iDigBio_halictidae", "SCAN_Halictidae", "SCAN_Halictidae", "GBIF_Halictidae", "iDigBio_apidae", "SCAN_Apidae", "SCAN_Apidae", "Ecd_Anthophila", "iDigBio_apidae", "Ecd_Anthophila", "SCAN_Apidae", "Ecd_Anthophila", "iDigBio_apidae", "Ecd_Anthophila", "SCAN_Megachilidae", "SCAN_Megachilidae", "CAES_Anthophila", "CAES_Anthophila", "CAES_Anthophila", "CAES_Anthophila" ) chordDiagramR( dupeData = basicData, outPath = tempdir(), fileName = "ChordDiagram.pdf", # These can be modified to help fit the final pdf that's exported. width = 9, height = 7.5, bg = "white", # How few distinct dataSources should a group have to be listed as "other" smallGrpThreshold = 3, title = "Duplicated record sources", # The default list of colour palettes to choose from using the paleteer package palettes = c("cartography::blue.pal", "cartography::green.pal", "cartography::sand.pal", "cartography::orange.pal", "cartography::red.pal", "cartography::purple.pal", "cartography::brown.pal"), canvas.ylim = c(-1.0,1.0), canvas.xlim = c(-0.6, 0.25), text.col = "black", legendX = grid::unit(6, "mm"), legendY = grid::unit(18, "mm"), legendJustify = c("left", "bottom"), niceFacing = TRUE) ## End(Not run)
This function uses readr::cols_only()
to assign a column name and the type of data
(e.g., readr::col_character()
,
and readr::col_integer()
). To see the default columns simply run ColTypeR()
.
This is intended for use with readr::read_csv()
. Columns that are not present will NOT be included
in the resulting tibble unless they are specified using ....
ColTypeR(...)
ColTypeR(...)
... |
Additional arguments. These can be specified in addition to the ones default to the function. For example:
|
Returns an object of class col_spec.
See readr::as.col_spec()
for additional context and explication.
# You can simply return the below for default values library(dplyr) BeeBDC::ColTypeR() # To add new columns you can write ColTypeR(newCharacterColumn = readr::col_character(), newNumericColumn = readr::col_integer(), newLogicalColumn = readr::col_logical()) # Try reading in one of the test datasets as an example: beesFlagged %>% dplyr::as_tibble(col_types = BeeBDC::ColTypeR()) # OR beesRaw %>% dplyr::as_tibble(col_types = BeeBDC::ColTypeR())
# You can simply return the below for default values library(dplyr) BeeBDC::ColTypeR() # To add new columns you can write ColTypeR(newCharacterColumn = readr::col_character(), newNumericColumn = readr::col_integer(), newLogicalColumn = readr::col_logical()) # Try reading in one of the test datasets as an example: beesFlagged %>% dplyr::as_tibble(col_types = BeeBDC::ColTypeR()) # OR beesRaw %>% dplyr::as_tibble(col_types = BeeBDC::ColTypeR())
This function flags continent-level outliers using the checklist provided with this package.
For additional context and column names, see beesChecklist()
.
continentOutlieRs( checklist = NULL, data = NULL, keepAdjacentContinent = FALSE, pointBuffer = NULL, scale = 50, stepSize = 1e+06, mc.cores = 1 )
continentOutlieRs( checklist = NULL, data = NULL, keepAdjacentContinent = FALSE, pointBuffer = NULL, scale = 50, stepSize = 1e+06, mc.cores = 1 )
checklist |
A data frame or tibble. The formatted checklist which was built based on the Discover Life website. |
data |
A data frame or tibble. The a Darwin Core occurrence dataset. |
keepAdjacentContinent |
Logical. If TRUE, occurrences in continents that are adjacent to checklist continents will be kept. If FALSE, they will be flagged. Defualt = FALSE. |
pointBuffer |
Numeric. A buffer around points to help them align with a continent or coastline. This provides a good way to retain points that occur right along the coast or borders of the maps in rnaturalearth |
scale |
Numeric. The value fed into the map scale parameter for
|
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
mc.cores |
Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. If the cores throw issues, consider setting mc.cores to 1. Default = 1. |
The input data with two new columns, .continentOutlier or .sea. There are three possible values for the new column: TRUE == passed, FALSE == failed (not in continent or in the ocean), NA == did not overlap with rnaturalearth map.
countryOutlieRs()
for implementation at the country level. Country-level
implementation will be more data-hungry and, where data do not yet exist, difficult to implement.
Additionally, see beesChecklist()
for input data. Note, not all columns are
necessary if you are building your own dataset. At a minimum you will need validName and
continent.
if(requireNamespace("rnaturalearthdata")){ library(magrittr) # Load in the test dataset beesRaw <- BeeBDC::beesRaw # For the sake of this example, use the testChecklist system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # For real examples, you might download the beesChecklist from FigShare using # [BeeBDC::beesChecklist()] beesRaw_out <- continentOutlieRs(checklist = testChecklist, data = beesRaw %>% dplyr::filter(dplyr::row_number() %in% 1:50), keepAdjacentContinent = FALSE, pointBuffer = 1, scale = 50, stepSize = 1000000, mc.cores = 1) table(beesRaw_out$.continentOutlier, useNA = "always") } # END if require
if(requireNamespace("rnaturalearthdata")){ library(magrittr) # Load in the test dataset beesRaw <- BeeBDC::beesRaw # For the sake of this example, use the testChecklist system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # For real examples, you might download the beesChecklist from FigShare using # [BeeBDC::beesChecklist()] beesRaw_out <- continentOutlieRs(checklist = testChecklist, data = beesRaw %>% dplyr::filter(dplyr::row_number() %in% 1:50), keepAdjacentContinent = FALSE, pointBuffer = 1, scale = 50, stepSize = 1000000, mc.cores = 1) table(beesRaw_out$.continentOutlier, useNA = "always") } # END if require
To use this function, the user must choose a column, probably "coordinateUncertaintyInMeters" and a threshold above which occurrences will be flagged for geographic uncertainty.
coordUncerFlagR( data = NULL, uncerColumn = "coordinateUncertaintyInMeters", threshold = NULL )
coordUncerFlagR( data = NULL, uncerColumn = "coordinateUncertaintyInMeters", threshold = NULL )
data |
A data frame or tibble. Occurrence records as input. |
uncerColumn |
Character. The column to flag uncertainty in. |
threshold |
Numeric. The uncertainty threshold. Values equal to, or greater than, this threshold will be flagged. |
The input data with a new column, .uncertaintyThreshold.
# Run the function beesRaw_out <- coordUncerFlagR(data = beesRaw, uncerColumn = "coordinateUncertaintyInMeters", threshold = 1000) # View the output table(beesRaw_out$.uncertaintyThreshold, useNA = "always")
# Run the function beesRaw_out <- coordUncerFlagR(data = beesRaw, uncerColumn = "coordinateUncertaintyInMeters", threshold = 1000) # View the output table(beesRaw_out$.uncertaintyThreshold, useNA = "always")
This function is basic for a user to manually fix some country name inconsistencies.
countryNameCleanR(data = NULL, ISO2_table = NULL, commonProblems = NULL)
countryNameCleanR(data = NULL, ISO2_table = NULL, commonProblems = NULL)
data |
A data frame or tibble. Occurrence records as input. |
ISO2_table |
A data frame or tibble with the columns ISO2 and long names for country names. Default is a static version from Wikipedia. |
commonProblems |
A data frame or tibble. It must have two columns: one containing the user-identified problem and one with a user-defined fix |
Returns the input data, but with countries occurring in the user-supplied problem column ("commonProblems") replaced with those in the user-supplied fix column
beesFlagged_out <- countryNameCleanR( data = BeeBDC::beesFlagged, commonProblems = dplyr::tibble(problem = c('U.S.A.', 'US','USA','usa','UNITED STATES', 'United States','U.S.A','MX','CA','Bras.','Braz.', 'Brasil','CNMI','USA TERRITORY: PUERTO RICO'), fix = c('United States of America','United States of America', 'United States of America','United States of America', 'United States of America','United States of America', 'United States of America','Mexico','Canada','Brazil', 'Brazil','Brazil','Northern Mariana Islands','PUERTO.RICO')))
beesFlagged_out <- countryNameCleanR( data = BeeBDC::beesFlagged, commonProblems = dplyr::tibble(problem = c('U.S.A.', 'US','USA','usa','UNITED STATES', 'United States','U.S.A','MX','CA','Bras.','Braz.', 'Brasil','CNMI','USA TERRITORY: PUERTO RICO'), fix = c('United States of America','United States of America', 'United States of America','United States of America', 'United States of America','United States of America', 'United States of America','Mexico','Canada','Brazil', 'Brazil','Brazil','Northern Mariana Islands','PUERTO.RICO')))
This function flags country-level outliers using the checklist provided with this package.
For additional context and column names, see beesChecklist()
.
countryOutlieRs( checklist = NULL, data = NULL, keepAdjacentCountry = TRUE, pointBuffer = NULL, scale = 50, stepSize = 1e+06, mc.cores = 1 )
countryOutlieRs( checklist = NULL, data = NULL, keepAdjacentCountry = TRUE, pointBuffer = NULL, scale = 50, stepSize = 1e+06, mc.cores = 1 )
checklist |
A data frame or tibble. The formatted checklist which was built based on the Discover Life website. |
data |
A data frame or tibble. The a Darwin Core occurrence dataset. |
keepAdjacentCountry |
Logical. If TRUE, occurrences in countries that are adjacent to checklist countries will be kept. If FALSE, they will be flagged. |
pointBuffer |
Numeric. A buffer around points to help them align with a country or coastline. This provides a good way to retain points that occur right along the coast or borders of the maps in rnaturalearth |
scale |
Numeric. The value fed into the map scale parameter for
|
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
mc.cores |
Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. If the cores throw issues, consider setting mc.cores to 1. Default = 1. |
The input data with two new columns, .countryOutlier or .sea. There are three possible values for the new column: TRUE == passed, FALSE == failed (not in country or in the ocean), NA == did not overlap with rnaturalearth map.
continentOutlieRs()
for implementation at the continent level. Implementation
at the continent level may be lighter and more manageable on the data side of things where
country-level checklists don't exist. Additionally, see beesChecklist()
for input data.
Note, not all columns are necessary if you are building your own dataset.
At a minimum you will need validName, country, iso_a3_eh (to match rnaturalearth).
if(requireNamespace("rnaturalearthdata")){ library(magrittr) # Load in the test dataset beesRaw <- BeeBDC::beesRaw # For the sake of this example, use the testChecklist system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # For real examples, you might download the beesChecklist from FigShare using # [BeeBDC::beesChecklist()] beesRaw_out <- countryOutlieRs(checklist = testChecklist, data = beesRaw %>% dplyr::filter(dplyr::row_number() %in% 1:50), keepAdjacentCountry = TRUE, pointBuffer = 1, scale = 50, stepSize = 1000000, mc.cores = 1) table(beesRaw_out$.countryOutlier, useNA = "always") } # END if require
if(requireNamespace("rnaturalearthdata")){ library(magrittr) # Load in the test dataset beesRaw <- BeeBDC::beesRaw # For the sake of this example, use the testChecklist system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load() # For real examples, you might download the beesChecklist from FigShare using # [BeeBDC::beesChecklist()] beesRaw_out <- countryOutlieRs(checklist = testChecklist, data = beesRaw %>% dplyr::filter(dplyr::row_number() %in% 1:50), keepAdjacentCountry = TRUE, pointBuffer = 1, scale = 50, stepSize = 1000000, mc.cores = 1) table(beesRaw_out$.countryOutlier, useNA = "always") } # END if require
This function will attempt to find and build a table of data providers that have contributed to the input data, especially using the 'institutionCode' column. It will also look for a variety of other columns to find data providers using an internally set sequence of if-else statements. Hence, this function is quite specific for bee data, but should work for other taxa in similar institutions.
dataProvTables( data = NULL, runBeeDataChecks = FALSE, outPath = OutPath_Report, fileName = NULL )
dataProvTables( data = NULL, runBeeDataChecks = FALSE, outPath = OutPath_Report, fileName = NULL )
data |
A data frame or tibble. Occurrence records as input. |
runBeeDataChecks |
Logical. If TRUE, will search in other columns for specific clues to determine the institution. |
outPath |
A character path. The path to the directory in which the figure will be saved. Default = OutPath_Report. |
fileName |
Character. The name of the file to be saved, ending in ".csv". |
Returns a table with the data providers, an specimen count, and a species count.
data(beesFlagged) testOut <- dataProvTables( data = beesFlagged, runBeeDataChecks = TRUE, outPath = tempdir(), fileName = "testFile.csv")
data(beesFlagged) testOut <- dataProvTables( data = beesFlagged, runBeeDataChecks = TRUE, outPath = tempdir(), fileName = "testFile.csv")
Used at the end of 1.x in the example workflow in order to save the occurrence dataset and its associated eml metadata.
dataSaver( path = NULL, save_type = NULL, occurrences = NULL, eml_files = NULL, file_prefix = NULL )
dataSaver( path = NULL, save_type = NULL, occurrences = NULL, eml_files = NULL, file_prefix = NULL )
path |
Character. The main file path to look for data in. |
save_type |
Character. The file format in which to save occurrence and EML data. Either "R_file" or "CSV_file" |
occurrences |
The occurrences to save as a data frame or tibble. |
eml_files |
A list of the EML files. |
file_prefix |
Character. A prefix for the resulting output file. |
This function saves both occurrence and EML data as a list when save_type = "R_File" or as individual csv files when save_type = "CSV_file".
## Not run: dataSaver(path = tempdir(),# The main path to look for data in save_type = "CSV_file", # "R_file" OR "CSV_file" occurrences = Complete_data$Data_WebDL, # The existing datasheet eml_files = Complete_data$eml_files, # The existing EML files file_prefix = "Fin_") # The prefix for the file name ## End(Not run)
## Not run: dataSaver(path = tempdir(),# The main path to look for data in save_type = "CSV_file", # "R_file" OR "CSV_file" occurrences = Complete_data$Data_WebDL, # The existing datasheet eml_files = Complete_data$eml_files, # The existing EML files file_prefix = "Fin_") # The prefix for the file name ## End(Not run)
A function made to search other columns for dates and add them to the eventDate column. The function searches the columns locality, fieldNotes, locationRemarks, and verbatimEventDate for the relevant information.
dateFindR(data = NULL, maxYear = lubridate::year(Sys.Date()), minYear = 1700)
dateFindR(data = NULL, maxYear = lubridate::year(Sys.Date()), minYear = 1700)
data |
A data frame or tibble. Occurrence records as input. |
maxYear |
Numeric. The maximum year considered reasonable to find. Default = lubridate::year(Sys.Date()). |
minYear |
Numeric. The minimum year considered reasonable to find. Default = 1700. |
The function results in the input occurrence data with but with updated eventDate, year, month, and day columns for occurrences where these data were a) missing and b) located in one of the searched columns.
# Using the example dataset, you may not find any missing eventDates are rescued (dependent on # which version of the example dataset the user inputs. beesRaw_out <- dateFindR(data = beesRaw, # Years above this are removed (from the recovered dates only) maxYear = lubridate::year(Sys.Date()), # Years below this are removed (from the recovered dates only) minYear = 1700)
# Using the example dataset, you may not find any missing eventDates are rescued (dependent on # which version of the example dataset the user inputs. beesRaw_out <- dateFindR(data = beesRaw, # Years above this are removed (from the recovered dates only) maxYear = lubridate::year(Sys.Date()), # Years below this are removed (from the recovered dates only) minYear = 1700)
A simple function that looks for potential latitude and longitude fill-down errors by identifying consecutive occurrences with coordinates at regular intervals. This is accomplished by using a sliding window with the length determined by minRepeats.
diagonAlley( data = NULL, minRepeats = NULL, groupingColumns = c("eventDate", "recordedBy", "datasetName"), ndec = 3, stepSize = 1e+06, mc.cores = 1 )
diagonAlley( data = NULL, minRepeats = NULL, groupingColumns = c("eventDate", "recordedBy", "datasetName"), ndec = 3, stepSize = 1e+06, mc.cores = 1 )
data |
A data frame or tibble. Occurrence records as input. |
minRepeats |
Numeric. The minimum number of lat or lon repeats needed to flag a record |
groupingColumns |
Character. The column(s) to group the analysis by and search for fill-down errors within. Default = c("eventDate", "recordedBy", "datasetName"). |
ndec |
Numeric. The number of decimal places below which records will not be considered
in the diagonAlley function. This is fed into |
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
mc.cores |
Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1. |
The sliding window (and hence fill-down errors) will only be examined within the user-defined groupingColumns; if any of those columns are empty, that record will be excluded.
The function returns the input data with a new column, .sequential, where FALSE = records that have consecutive latitudes or longitudes greater than or equal to the user-defined threshold.
# Read in the example data data(beesRaw) # Run the function beesRaw_out <- diagonAlley( data = beesRaw, # The minimum number of repeats needed to find a sequence in for flagging minRepeats = 4, groupingColumns = c("eventDate", "recordedBy", "datasetName"), ndec = 3, stepSize = 1000000, mc.cores = 1)
# Read in the example data data(beesRaw) # Run the function beesRaw_out <- diagonAlley( data = beesRaw, # The minimum number of repeats needed to find a sequence in for flagging minRepeats = 4, groupingColumns = c("eventDate", "recordedBy", "datasetName"), ndec = 3, stepSize = 1000000, mc.cores = 1)
This function sets up a directory for saving outputs (i.e. data, figures) generated through the use of the BeeBDC package, if the required folders do not already exist.
dirMaker( RootPath = RootPath, ScriptPath = NULL, DataPath = NULL, DataSubPath = "/Data_acquisition_workflow", DiscLifePath = NULL, OutPath = NULL, OutPathName = "Output", Report = TRUE, Check = TRUE, Figures = TRUE, Intermediate = TRUE, RDoc = NULL, useHere = TRUE )
dirMaker( RootPath = RootPath, ScriptPath = NULL, DataPath = NULL, DataSubPath = "/Data_acquisition_workflow", DiscLifePath = NULL, OutPath = NULL, OutPathName = "Output", Report = TRUE, Check = TRUE, Figures = TRUE, Intermediate = TRUE, RDoc = NULL, useHere = TRUE )
RootPath |
A character String. The |
ScriptPath |
A character String. The |
DataPath |
A character string. The path to the folder containing bee occurrence data to be flagged and/or cleaned |
DataSubPath |
A character String. If a |
DiscLifePath |
A character String. The path to the folder which contains data from Ascher and Pcikering's Discover Life website. |
OutPath |
A character String. The path to the folder where output data will be saved. |
OutPathName |
A character String. The name of the |
Report |
Logical. If TRUE, function creates a "Report" folder within the OutPath-defined folder. Default = TRUE. |
Check |
Logical. If TRUE, function creates a "Check" folder within the OutPath-defined folder. Default = TRUE. |
Figures |
Logical. If TRUE, function creates a "Figures" folder within the OutPath-defined folder. Default = TRUE. |
Intermediate |
Logical. If TRUE, function creates a "Intermediate" folder within the OutPath-defined folder in which to save intermediate datasets. Default = TRUE. |
RDoc |
A character String. The path to the current script or report, relative to the project
root. Passing an absolute path raises an error. This argument is used by |
useHere |
Logical. If TRUE, dirMaker will use |
Results in the generation of a list containing the BeeBDC-required directories in your global environment. This function should be run at the start of each session. Additionally, this function will create the BeeBDC-required folders if they do not already exist in the supplied directory
# load dplyr library(dplyr) # Standard/basic usage: RootPath <- tempdir() dirMaker( RootPath = RootPath, # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment()) # Custom OutPathName provided dirMaker( RootPath = RootPath, # Set some custom OutPath info OutPath = NULL, OutPathName = "T2T_Output", # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment()) # Set the working directory # Further customisations are also possible dirMaker( RootPath = RootPath, ScriptPath = "...path/Bee_SDM_paper/BDC_repo/BeeBDC/R", DiscLifePath = "...path/BDC_repo/DiscoverLife_Data", OutPathName = "AsianPerspective_Output", # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment())
# load dplyr library(dplyr) # Standard/basic usage: RootPath <- tempdir() dirMaker( RootPath = RootPath, # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment()) # Custom OutPathName provided dirMaker( RootPath = RootPath, # Set some custom OutPath info OutPath = NULL, OutPathName = "T2T_Output", # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment()) # Set the working directory # Further customisations are also possible dirMaker( RootPath = RootPath, ScriptPath = "...path/Bee_SDM_paper/BDC_repo/BeeBDC/R", DiscLifePath = "...path/BDC_repo/DiscoverLife_Data", OutPathName = "AsianPerspective_Output", # Input the location of the workflow script RELATIVE to the RootPath RDoc = NULL, useHere = FALSE) %>% # Add paths created by this function to the environment() list2env(envir = environment())
Creates a plot with two bar graphs. One shows the absolute number of duplicate records for each
data source
while the other shows the proportion of records that are duplicated within each data source.
This function requires a dataset that has been run through dupeSummary()
.
dupePlotR( data = NULL, outPath = NULL, fileName = NULL, legend.position = c(0.85, 0.8), base_height = 7, base_width = 7, ..., dupeColours = c("#F2D2A2", "#B9D6BC", "#349B90"), returnPlot = FALSE )
dupePlotR( data = NULL, outPath = NULL, fileName = NULL, legend.position = c(0.85, 0.8), base_height = 7, base_width = 7, ..., dupeColours = c("#F2D2A2", "#B9D6BC", "#349B90"), returnPlot = FALSE )
data |
A data frame or tibble. Occurrence records as input. |
outPath |
Character. The path to a directory (folder) in which the output should be saved. |
fileName |
Character. The name of the output file, ending in '.pdf'. |
legend.position |
The position of the legend as coordinates. Default = c(0.85, 0.8). |
base_height |
Numeric. The height of the plot in inches. Default = 7. |
base_width |
Numeric. The width of the plot in inches. Default = 7. |
... |
Other arguments to be used to change factor levels of data sources. |
dupeColours |
A vector of colours for the levels duplicate, kept duplicate, and unique. Default = c("#F2D2A2","#B9D6BC", "#349B90"). |
returnPlot |
Logical. If TRUE, return the plot to the environment. Default = FALSE. |
Outputs a .pdf figure.
# This example will show a warning for the factor levels taht are not present in the specific # test dataset dupePlotR( data = beesFlagged, # The outPath to save the plot as # Should be something like: #paste0(OutPath_Figures, "/duplicatePlot_TEST.pdf"), outPath = tempdir(), fileName = "duplicatePlot_TEST.pdf", # Colours in order: duplicate, kept duplicate, unique dupeColours = c("#F2D2A2","#B9D6BC", "#349B90"), # Plot size and height base_height = 7, base_width = 7, legend.position = c(0.85, 0.8), # Extra variables can be fed into forcats::fct_recode() to change names on plot GBIF = "GBIF", SCAN = "SCAN", iDigBio = "iDigBio", USGS = "USGS", ALA = "ALA", ASP = "ASP", CAES = "CAES", 'B. Mont.' = "BMont", 'B. Minckley' = "BMin", Ecd = "Ecd", Gaiarsa = "Gai", EPEL = "EPEL", Lic = "Lic", Bal = "Bal", Arm = "Arm" )
# This example will show a warning for the factor levels taht are not present in the specific # test dataset dupePlotR( data = beesFlagged, # The outPath to save the plot as # Should be something like: #paste0(OutPath_Figures, "/duplicatePlot_TEST.pdf"), outPath = tempdir(), fileName = "duplicatePlot_TEST.pdf", # Colours in order: duplicate, kept duplicate, unique dupeColours = c("#F2D2A2","#B9D6BC", "#349B90"), # Plot size and height base_height = 7, base_width = 7, legend.position = c(0.85, 0.8), # Extra variables can be fed into forcats::fct_recode() to change names on plot GBIF = "GBIF", SCAN = "SCAN", iDigBio = "iDigBio", USGS = "USGS", ALA = "ALA", ASP = "ASP", CAES = "CAES", 'B. Mont.' = "BMont", 'B. Minckley' = "BMin", Ecd = "Ecd", Gaiarsa = "Gai", EPEL = "EPEL", Lic = "Lic", Bal = "Bal", Arm = "Arm" )
This function uses user-specified inputs and columns to identify duplicate occurrence records. Duplicates are identified iteratively and will be tallied up, duplicate pairs clustered, and sorted at the end of the function. The function is designed to work with Darwin Core data with a database_id column, but it is also modifiable to work with other columns.
dupeSummary( data = NULL, path = NULL, duplicatedBy = NULL, completeness_cols = NULL, idColumns = NULL, collectionCols = NULL, collectInfoColumns = NULL, CustomComparisonsRAW = NULL, CustomComparisons = NULL, sourceOrder = NULL, prefixOrder = NULL, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".uncertaintyThreshold", ".unLicensed"), characterThreshold = 2, numberThreshold = 3, numberOnlyThreshold = 5, catalogSwitch = TRUE )
dupeSummary( data = NULL, path = NULL, duplicatedBy = NULL, completeness_cols = NULL, idColumns = NULL, collectionCols = NULL, collectInfoColumns = NULL, CustomComparisonsRAW = NULL, CustomComparisons = NULL, sourceOrder = NULL, prefixOrder = NULL, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".uncertaintyThreshold", ".unLicensed"), characterThreshold = 2, numberThreshold = 3, numberOnlyThreshold = 5, catalogSwitch = TRUE )
data |
A data frame or tibble. Occurrence records as input. |
path |
A character path to the location where the duplicateRun_ file will be saved. |
duplicatedBy |
A character vector. Options are c("ID", "collectionInfo", "both"). "ID" columns runs through a series of ID-only columns defined by idColumns. "collectionInfo" runs through a series of columns defined by collectInfoColumns, which are checked in combination with collectionCols. "both" runs both of the above. |
completeness_cols |
A character vector. A set of columns that are used to order and select
duplicates by. For each occurrence, this function will calculate the sum of |
idColumns |
A character vector. The columns to be checked individually for internal duplicates. Intended for use with ID columns only. |
collectionCols |
A character vector. The columns to be checked in combination with each of the completeness_cols. |
collectInfoColumns |
A character vector. The columns to be checked in combinatino with all of the collectionCols columns. |
CustomComparisonsRAW |
A list of character vectors. Custom comparisons - as a list of columns to iteratively compare for duplicates. These differ from the CustomComparisons in that they ignore the minimum number and character thresholds for IDs. |
CustomComparisons |
A list of character vectors. Custom comparisons - as a list of columns to iteratively compare for duplicates. These comparisons are made after character and number thresholds are accounted for in ID columns. |
sourceOrder |
A character vector. The order in which you want to KEEP duplicated based on the dataSource column (i.e. what order to prioritize data sources). NOTE: These dataSources are simplified to the string prior to the first "_". Hence, "GBIF_Anthophyla" becomes "GBIF." |
prefixOrder |
A character vector. Like sourceOrder, except based on the database_id prefix, rather than the dataSource. Additionally, this is only examined if prefixOrder != NULL. Default = NULL. |
dontFilterThese |
A character vector. This should contain the flag columns to be ignored
in the creation or updating of the .summary column. Passed to |
characterThreshold |
Numeric. The complexity threshold for ID letter length. This is the minimum number of characters that need to be present in ADDITION TO the numberThreshold for an ID number to be tested for duplicates. Ignored by CustomComparisonsRAW. The columns that are checked are occurrenceID, recordId, id, catalogNumber, and otherCatalogNumbers. Default = 2. |
numberThreshold |
Numeric. The complexity threshold for ID number length. This is the minimum number of numeric characters that need to be present in ADDITION TO the characterThreshold for an ID number to be tested for duplicates. Ignored by CustomComparisonsRAW. The columns that are checked are occurrenceID, recordId, id, catalogNumber, and otherCatalogNumbers. Default = 3. |
numberOnlyThreshold |
Numeric. As numberThreshold except the characterThreshold is ignored. Default = 5. |
catalogSwitch |
Logical. If TRUE, and the catalogNumber is empty the function will copy over the otherCatalogNumbers into catalogNumber and visa versa. Hence, the function will attempt to matchmore catalog numbers as both of these functions can be problematic. Default = TRUE. |
Returns data with an additional column called .duplicates where FALSE occurrences are
duplicates and TRUE occurrences are either kept duplicates or unique. Also exports a .csv to
the user-specified location with information about duplicate matching. This file is used by
other functions including
manualOutlierFindeR()
and chordDiagramR()
chordDiagramR()
for creating a chord diagram to visualise linkages between
dataSources and dupePlotR()
to visualise the numbers and proportions of duplicates in
each dataSource.
beesFlagged_out <- dupeSummary( data = BeeBDC::beesFlagged, # Should start with paste0(DataPath, "/Output/Report/"), instead of tempdir(): path = paste0(tempdir(), "/"), # options are "ID","collectionInfo", or "both" duplicatedBy = "collectionInfo", # I'm only running ID for the first lot because we might # recover other info later # The columns to generate completeness info from completeness_cols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate"), # idColumns = c("gbifID", "occurrenceID", "recordId","id"), # The columns to ADDITIONALLY consider when finding duplicates in collectionInfo collectionCols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate", "recordedBy"), # The columns to combine, one-by-one with the collectionCols collectInfoColumns = c("catalogNumber", "otherCatalogNumbers"), # Custom comparisons - as a list of columns to compare # RAW custom comparisons do not use the character and number thresholds CustomComparisonsRAW = dplyr::lst(c("catalogNumber", "institutionCode", "scientificName")), # Other custom comparisons use the character and number thresholds CustomComparisons = dplyr::lst(c("gbifID", "scientificName"), c("occurrenceID", "scientificName"), c("recordId", "scientificName"), c("id", "scientificName")), # The order in which you want to KEEP duplicated based on data source # try unique(check_time$dataSource) sourceOrder = c("CAES", "Gai", "Ecd","BMont", "BMin", "EPEL", "ASP", "KP", "EcoS", "EaCO", "FSCA", "Bal", "SMC", "Lic", "Arm", "USGS", "ALA", "GBIF","SCAN","iDigBio"), # !!!!!! BELS > GeoLocate # Set the complexity threshold for id letter and number length # minimum number of characters when WITH the numberThreshold characterThreshold = 2, # minimum number of numbers when WITH the characterThreshold numberThreshold = 3, # Minimum number of numbers WITHOUT any characters numberOnlyThreshold = 5)
beesFlagged_out <- dupeSummary( data = BeeBDC::beesFlagged, # Should start with paste0(DataPath, "/Output/Report/"), instead of tempdir(): path = paste0(tempdir(), "/"), # options are "ID","collectionInfo", or "both" duplicatedBy = "collectionInfo", # I'm only running ID for the first lot because we might # recover other info later # The columns to generate completeness info from completeness_cols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate"), # idColumns = c("gbifID", "occurrenceID", "recordId","id"), # The columns to ADDITIONALLY consider when finding duplicates in collectionInfo collectionCols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate", "recordedBy"), # The columns to combine, one-by-one with the collectionCols collectInfoColumns = c("catalogNumber", "otherCatalogNumbers"), # Custom comparisons - as a list of columns to compare # RAW custom comparisons do not use the character and number thresholds CustomComparisonsRAW = dplyr::lst(c("catalogNumber", "institutionCode", "scientificName")), # Other custom comparisons use the character and number thresholds CustomComparisons = dplyr::lst(c("gbifID", "scientificName"), c("occurrenceID", "scientificName"), c("recordId", "scientificName"), c("id", "scientificName")), # The order in which you want to KEEP duplicated based on data source # try unique(check_time$dataSource) sourceOrder = c("CAES", "Gai", "Ecd","BMont", "BMin", "EPEL", "ASP", "KP", "EcoS", "EaCO", "FSCA", "Bal", "SMC", "Lic", "Arm", "USGS", "ALA", "GBIF","SCAN","iDigBio"), # !!!!!! BELS > GeoLocate # Set the complexity threshold for id letter and number length # minimum number of characters when WITH the numberThreshold characterThreshold = 2, # minimum number of numbers when WITH the characterThreshold numberThreshold = 3, # Minimum number of numbers WITHOUT any characters numberOnlyThreshold = 5)
A function which can be used to find files within a user-defined directory based on a user-provided character string.
fileFinder(path, fileName)
fileFinder(path, fileName)
path |
A directory as character. The directory to recursively search. |
fileName |
A character/regex string. The file name to find. |
Returns a directory to the most-recent file that matches the provied file Using regex can greatly improve specificity. Using regex can greatly improve specificity. The function will also write into the console the file that it has found - it is worthwhile to check that this is the correct file to avoid complications down the line
# load dplyr library(dplyr) # Make the RootPath to the tempdir for this example RootPath <- tempdir() # Load the example data data("beesRaw", package = "BeeBDC") # Save and example dataset to the temp dir readr::write_csv(beesRaw, file = paste0(RootPath, "/beesRaw.csv")) # Now go find it! fileFinder(path = RootPath, fileName = "beesRaw") # more specifically the .csv version fileFinder(path = RootPath, fileName = "beesRaw.csv")
# load dplyr library(dplyr) # Make the RootPath to the tempdir for this example RootPath <- tempdir() # Load the example data data("beesRaw", package = "BeeBDC") # Save and example dataset to the temp dir readr::write_csv(beesRaw, file = paste0(RootPath, "/beesRaw.csv")) # Now go find it! fileFinder(path = RootPath, fileName = "beesRaw") # more specifically the .csv version fileFinder(path = RootPath, fileName = "beesRaw.csv")
Flags occurrences that are "ABSENT" for the occurrenceStatus (or some other user-specified) column.
flagAbsent(data = NULL, PresAbs = "occurrenceStatus")
flagAbsent(data = NULL, PresAbs = "occurrenceStatus")
data |
A data frame or tibble. Occurrence records as input. |
PresAbs |
Character. The column in which the function will find "ABSENT" or "PRESENT" records. Default = "occurrenceStatus" |
The input data with a new column called ".occurrenceAbsent" where FALSE == "ABSENT" records.
# Bring in the data data(beesRaw) # Run the function beesRaw_out <- flagAbsent(data = beesRaw, PresAbs = "occurrenceStatus") # See the result table(beesRaw_out$.occurrenceAbsent, useNA = "always")
# Bring in the data data(beesRaw) # Run the function beesRaw_out <- flagAbsent(data = beesRaw, PresAbs = "occurrenceStatus") # See the result table(beesRaw_out$.occurrenceAbsent, useNA = "always")
This function will search for strings that indicate a record is restricted in its use and will flag the restricted records.
flagLicense(data = NULL, strings_to_restrict = "all", excludeDataSource = NULL)
flagLicense(data = NULL, strings_to_restrict = "all", excludeDataSource = NULL)
data |
A data frame or tibble. Occurrence records as input. |
strings_to_restrict |
A character vector. Should contain the strings used to detect protected records. Default = c("All Rights Reserved", "All rights reserved", "All rights reserved.", "ND", "Not for public") |
excludeDataSource |
Optional. A character vector. A vector of the data sources (dataSource) that will not be flagged as protected, even if they are. This is useful if you have a private dataset that should be listed as "All rights reserved" which you want to be ignored by this flag. |
Returns the data with a new column, .unLicensed, where FALSE = records that are protected by a license.
# Read in the example data data("beesRaw") # Run the function beesRaw_out <- flagLicense(data = beesRaw, strings_to_restrict = "all", # DON'T flag if in the following data# source(s) excludeDataSource = NULL)
# Read in the example data data("beesRaw") # Run the function beesRaw_out <- flagLicense(data = beesRaw, strings_to_restrict = "all", # DON'T flag if in the following data# source(s) excludeDataSource = NULL)
This function is used to save the flag data for your occurrence data as you run the BeeBDC script. It will read and append existing files, if asked to. Your flags should also be saved in the occurrence file itself automatically.
flagRecorder( data = NULL, outPath = NULL, fileName = NULL, idColumns = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"), append = NULL, printSummary = FALSE )
flagRecorder( data = NULL, outPath = NULL, fileName = NULL, idColumns = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"), append = NULL, printSummary = FALSE )
data |
A data frame or tibble. Occurrence records as input. |
outPath |
A character path. Where the file should be saved. |
fileName |
Character. The name of the file to be saved |
idColumns |
A character vector. The names of the columns that are to be kept along with the flag columns. These columns should be useful for identifying unique records with flags. Default = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"). |
append |
Logical. If TRUE, this will find and append an existing file generated by this function. |
printSummary |
Logical. If TRUE, print a |
Saves a file with id and flag columns and returns this as an object.
# Load the example data data("beesFlagged") # Run the function OutPath_Report <- tempdir() flagFile <- flagRecorder( data = beesFlagged, outPath = paste(OutPath_Report, sep =""), fileName = paste0("flagsRecorded_", Sys.Date(), ".csv"), # These are the columns that will be kept along with the flags idColumns = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"), # TRUE if you want to find a file from a previous part of the script to append to append = FALSE)
# Load the example data data("beesFlagged") # Run the function OutPath_Report <- tempdir() flagFile <- flagRecorder( data = beesFlagged, outPath = paste(OutPath_Report, sep =""), fileName = paste0("flagsRecorded_", Sys.Date(), ".csv"), # These are the columns that will be kept along with the flags idColumns = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"), # TRUE if you want to find a file from a previous part of the script to append to append = FALSE)
Takes a flagged dataset and returns the total number of fails (FALSE) per flag (columns starting with ".") and per species. It will ignore the .scientificName_empty and .invalidName columns as species are not assigned. Users may define the column to group the summary by. While it is intended to work with the scientificName column, users may select any grouping column (e.g., country).
flagSummaryTable( data = NULL, column = "scientificName", outPath = OutPath_Report, fileName = "flagTable.csv", percentImpacted = TRUE, percentThreshold = 0 )
flagSummaryTable( data = NULL, column = "scientificName", outPath = OutPath_Report, fileName = "flagTable.csv", percentImpacted = TRUE, percentThreshold = 0 )
data |
A data frame or tibble. The flagged dataset. |
column |
Character. The name of the column to group by and summarise the failed occurrences. Default = "scientificName". |
outPath |
A character path. The path to the directory in which the figure will be saved. Default = OutPath_Report. If is NULL then no file will be saved to the disk. |
fileName |
Character. The name of the file to be saved, ending in ".csv". Default = "flagTable.csv". |
percentImpacted |
Logical. If TRUE (the default), the program will write the percentage of species impacted and over the percentThreshold for each flagging column. |
percentThreshold |
Numeric. A number between 0 and 100 to indicate the percent of individuals (>; within each species) that is impacted by a flag, and to be included in the percentImpacted. Default = 0. |
A tibble with a column for each flag column (starting with ".") showing the number of failed (FALSE) occurrences per group. Also shows the (i) total number of records, (ii) total number of failed records, and (iii) the percentage of failed records.
# Load the toy flagged bee data data("beesFlagged") # Run the function and build the flag table flagTibble <- flagSummaryTable(data = beesFlagged, column = "scientificName", outPath = paste0(tempdir()), fileName = "flagTable.csv")
# Load the toy flagged bee data data("beesFlagged") # Run the function and build the flag table flagTibble <- flagSummaryTable(data = beesFlagged, column = "scientificName", outPath = paste0(tempdir()), fileName = "flagTable.csv")
Merges the Darwin Core version of the USGS dataset that was created using USGS_formatter()
with the main dataset.
formattedCombiner(path, strings, existingOccurrences, existingEMLs)
formattedCombiner(path, strings, existingOccurrences, existingEMLs)
path |
A directory as character. The directory to look in for the formatted USGS data. |
strings |
A regex string. The string to find the most-recent formatted USGS dataset. |
existingOccurrences |
A data frame. The existing occurrence dataset. |
existingEMLs |
An EML file. The existing EML data file to be appended. |
A list with the combined occurrence dataset and the updated EML file.
## Not run: DataPath <- tempdir() strings = c("USGS_DRO_flat_27-Apr-2022") # Combine the USGS data and the existing big dataset Complete_data <- formattedCombiner(path = DataPath, strings = strings, # This should be the list-format with eml attached existingOccurrences = DataImp$Data_WebDL, existingEMLs = DataImp$eml_files) ## End(Not run)
## Not run: DataPath <- tempdir() strings = c("USGS_DRO_flat_27-Apr-2022") # Combine the USGS data and the existing big dataset Complete_data <- formattedCombiner(path = DataPath, strings = strings, # This should be the list-format with eml attached existingOccurrences = DataImp$Data_WebDL, existingEMLs = DataImp$eml_files) ## End(Not run)
This function will flag records which are subject to a user-specified vector of GBIF issues.
GBIFissues(data = NULL, issueColumn = "issue", GBIFflags = NULL)
GBIFissues(data = NULL, issueColumn = "issue", GBIFflags = NULL)
data |
A data frame or tibble. Occurrence records as input. |
issueColumn |
Character. The column in which to look for GBIF issues. Default = "issue". |
GBIFflags |
Character vector. The GBIF issues to flag. Users may choose their own vector of issues to flag or use a pre-set vector or vectors, including c("allDates", "allMetadata", "allObservations", "allSpatial", "allTaxo", or "all"). Default = c("COORDINATE_INVALID", "PRESUMED_NEGATED_LONGITUDE", "PRESUMED_NEGATED_LATITUDE", "COUNTRY_COORDINATE_MISMATCH", "ZERO_COORDINATE") |
Returns the data with a new column, ".GBIFflags", where FALSE = records with any of the provided GBIFflags.
# Import the example data data(beesRaw) # Run the function beesRaw_Out <- GBIFissues(data = beesRaw, issueColumn = "issue", GBIFflags = c("COORDINATE_INVALID", "ZERO_COORDINATE"))
# Import the example data data(beesRaw) # Run the function beesRaw_Out <- GBIFissues(data = beesRaw, issueColumn = "issue", GBIFflags = c("COORDINATE_INVALID", "ZERO_COORDINATE"))
Uses the Discover Life taxonomy to harmonise bee occurrences and flag those that do not match
the checklist. harmoniseR()
prefers to use the names_clean columns that is generated
by bdc::bdc_clean_names()
. While this is not required, you may find better results by running
that function on your dataset first.
This function could be hijacked to service other taxa if a user matched the format of the
beesTaxonomy()
file.
harmoniseR( data = NULL, path = NULL, taxonomy = BeeBDC::beesTaxonomy(), speciesColumn = "scientificName", rm_names_clean = TRUE, checkVerbatim = FALSE, stepSize = 1e+06, mc.cores = 1 )
harmoniseR( data = NULL, path = NULL, taxonomy = BeeBDC::beesTaxonomy(), speciesColumn = "scientificName", rm_names_clean = TRUE, checkVerbatim = FALSE, stepSize = 1e+06, mc.cores = 1 )
data |
A data frame or tibble. Occurrence records as input. |
path |
A directory as character. The path to a folder that the output can be saved. |
taxonomy |
A data frame or tibble. The bee taxonomy to use.
Default = |
speciesColumn |
Character. The name of the column containing species names. Default = "scientificName". |
rm_names_clean |
Logical. If TRUE then the names_clean column will be removed at the end of this function to help reduce confusion about this column later. Default = TRUE |
checkVerbatim |
Logical. If TRUE then the verbatimScientificName will be checked as well
for species matches. This matching will ONLY be done after harmoniseR has failed for the other
name columns. NOTE: this column is not first run through |
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
mc.cores |
Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1. |
The occurrences are returned with update taxonomy columns, including: scientificName, species, family, subfamily, genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship. A new column, .invalidName, is also added and is FALSE when the occurrence's name did not match the supplied taxonomy.
taxadbToBeeBDC()
to download any taxonomy (of any taxa or of bees) and
beesTaxonomy()
for the bee taxonomy download.
# load in the test dataset system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load() beesRaw_out <- BeeBDC::harmoniseR( #The path to a folder that the output can be saved path = tempdir(), # The formatted taxonomy file taxonomy = testTaxonomy, data = BeeBDC::beesFlagged, speciesColumn = "scientificName") table(beesRaw_out$.invalidName, useNA = "always")
# load in the test dataset system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load() beesRaw_out <- BeeBDC::harmoniseR( #The path to a folder that the output can be saved path = tempdir(), # The formatted taxonomy file taxonomy = testTaxonomy, data = BeeBDC::beesFlagged, speciesColumn = "scientificName") table(beesRaw_out$.invalidName, useNA = "always")
This function attempts to match database_ids from a prior bdc or BeeBDC run in order to keep this column somewhat consistent between iterations. However, not all records contain sufficient information for this to work flawlessly.
idMatchR( currentData = NULL, priorData = NULL, matchBy = NULL, completeness_cols = NULL, excludeDataset = NULL )
idMatchR( currentData = NULL, priorData = NULL, matchBy = NULL, completeness_cols = NULL, excludeDataset = NULL )
currentData |
A data frame or tibble. The NEW occurrence records as input. |
priorData |
A data frame or tibble. The PRIOR occurrence records as input. |
matchBy |
A list of character vectors Should contain the columns to iteratively compare. |
completeness_cols |
A character vector. The columns to check for completeness, arrange, and assign the relevant prior database_id. |
excludeDataset |
A character vector. The dataSources that are to be excluded from data matching. These should be static dataSources from minor providers. |
The input data frame returned with an updated database_id column that shows the database_ids as in priorData where they could be matched. Additionally, a columnd called idContinuity is returned where TRUE indicates a match to a prior database_id and FALSE indicates that a new database_id was assigned.
# Get the example data data("beesRaw", package = "BeeBDC") # Which datasets are static and should be excluded from matching? excludeDataset <- c("BMin", "BMont", "CAES", "EaCO", "Ecd", "EcoS", "Gai", "KP", "EPEL", "USGS", "FSCA", "SMC", "Bal", "Lic", "Arm", "BBD", "MEPB") # Match the data to itself just as an example of running the code. beesRaw_out <- idMatchR( currentData = beesRaw, priorData = beesRaw, # First matches will be given preference over later ones matchBy = dplyr::lst(c("gbifID"), c("catalogNumber", "institutionCode", "dataSource"), c("occurrenceID", "dataSource"), c("recordId", "dataSource"), c("id"), c("catalogNumber", "institutionCode")), # You can exclude datasets from prior by matching their prefixs - before first underscore: excludeDataset = excludeDataset)
# Get the example data data("beesRaw", package = "BeeBDC") # Which datasets are static and should be excluded from matching? excludeDataset <- c("BMin", "BMont", "CAES", "EaCO", "Ecd", "EcoS", "Gai", "KP", "EPEL", "USGS", "FSCA", "SMC", "Bal", "Lic", "Arm", "BBD", "MEPB") # Match the data to itself just as an example of running the code. beesRaw_out <- idMatchR( currentData = beesRaw, priorData = beesRaw, # First matches will be given preference over later ones matchBy = dplyr::lst(c("gbifID"), c("catalogNumber", "institutionCode", "dataSource"), c("occurrenceID", "dataSource"), c("recordId", "dataSource"), c("id"), c("catalogNumber", "institutionCode")), # You can exclude datasets from prior by matching their prefixs - before first underscore: excludeDataset = excludeDataset)
Looks for and imports the most-recent version of the occurrence data created by the repoMerge()
function.
importOccurrences(path = path, fileName = "^BeeData_")
importOccurrences(path = path, fileName = "^BeeData_")
path |
A directory as a character. The directory to recursively look in for the above data. |
fileName |
Character. A String of text to look for the most-recent dataset.
Default = "^BeeData_". Find faults by modifying |
A list with a data frame of merged occurrence records, "Data_WebDL", and a list of EML files contained in "eml_files".
## Not run: DataImp <- importOccurrences(path = DataPath) ## End(Not run)
## Not run: DataImp <- importOccurrences(path = DataPath) ## End(Not run)
Uses the occurrence data (preferably uncleaned) and outputs interactive .html maps that can be opened
in your browser to a specific directory. The maps can highlight if an occurrence has passed all filtering
(.summary == TRUE) or failed at least one filter (.summary == FALSE). This can be modified by first running
summaryFun()
to set the columns that you want to be highlighted. It can also highlight occurrences
flagged as expert-identified or country outliers.
interactiveMapR( data = NULL, outPath = NULL, lon = "decimalLongitude", lat = "decimalLatitude", speciesColumn = "scientificName", speciesList = NULL, countryList = NULL, jitterValue = NULL, onlySummary = TRUE, overWrite = TRUE, TrueAlwaysTop = FALSE, excludeApis_mellifera = TRUE, pointColours = c("blue", "darkred", "#ff7f00", "black") )
interactiveMapR( data = NULL, outPath = NULL, lon = "decimalLongitude", lat = "decimalLatitude", speciesColumn = "scientificName", speciesList = NULL, countryList = NULL, jitterValue = NULL, onlySummary = TRUE, overWrite = TRUE, TrueAlwaysTop = FALSE, excludeApis_mellifera = TRUE, pointColours = c("blue", "darkred", "#ff7f00", "black") )
data |
A data frame or tibble. Occurrence records to use as input. |
outPath |
A directory as character. Directory where to save output maps. |
lon |
Character. The name of the longitude column. Default = "decimalLongitude". |
lat |
Character. The name of the latitude column. Default = "decimalLatitude". |
speciesColumn |
Character. The name of the column containing species names (or another factor) to build individual maps from. Default = "scientificName". |
speciesList |
A character vector. Should contain species names as they appear in the speciesColumn to make maps of. User can also specify "ALL" in order to make maps of all species present in the data. Hence, a user may first filter their data and then use "ALL". |
countryList |
A character vector. Country names to map, or NULL for to map ALL countries. |
jitterValue |
Numeric. The amount, in decimal degrees, to jitter the map points by - this is important for separating stacked points with the same coordinates. |
onlySummary |
Logical. If TRUE, the function will not look to plot country or expert-identified outliers in different colours. |
overWrite |
Logical. If TRUE, the function will overwrite existing files in the provided directory that have the same name. Default = TRUE. |
TrueAlwaysTop |
If TRUE, the quality (TRUE) points will always be displayed on top of other points. If FALSE, then whichever layer was turned on most-recently will be displayed on top. |
excludeApis_mellifera |
Logical. If TRUE, will not map records for Apis mellifera. Note: in most cases A. mellifera has too many points, and the resulting map will take a long time to make and be difficult to open. Default = TRUE. |
pointColours |
A character vector of colours. In order provide colour for TRUE, FALSE, countryOutlier, and customOutlier. Default = c("blue", "darkred","#ff7f00", "black"). |
Exports .html interactive maps of bee occurrences to the specified directory.
if(requireNamespace("leaflet")){ OutPath_Figures <- tempdir() interactiveMapR( # occurrence data - start with entire dataset, filter down to these species data = BeeBDC::bees3sp, # %>% # Select only those species in the 100 randomly chosen # dplyr::filter(scientificName %in% beeData_interactive$scientificName), # Select only one species to map # dplyr::filter(scientificName %in% "Agapostemon sericeus (Forster, 1771)"), # Directory where to save files outPath = paste0(OutPath_Figures, "/interactiveMaps_TEST"), # lat long columns lon = "decimalLongitude", lat = "decimalLatitude", # Occurrence dataset column with species names speciesColumn = "scientificName", # Which species to map - a character vector of names or "ALL" # Note: "ALL" is defined AFTER filtering for country speciesList = "ALL", # studyArea countryList = NULL, # Point jitter to see stacked points - jitters an amount in decimal degrees jitterValue = 0.01, # If TRUE, it will only map the .summary column. Otherwise, it will map .summary # which will be over-written by countryOutliers and manualOutliers onlySummary = TRUE, excludeApis_mellifera = TRUE, overWrite = TRUE, # Colours for points which are flagged as TRUE, FALSE, countryOutlier, and customOutlier pointColours = c("blue", "darkred","#ff7f00", "black") ) } # END if require
if(requireNamespace("leaflet")){ OutPath_Figures <- tempdir() interactiveMapR( # occurrence data - start with entire dataset, filter down to these species data = BeeBDC::bees3sp, # %>% # Select only those species in the 100 randomly chosen # dplyr::filter(scientificName %in% beeData_interactive$scientificName), # Select only one species to map # dplyr::filter(scientificName %in% "Agapostemon sericeus (Forster, 1771)"), # Directory where to save files outPath = paste0(OutPath_Figures, "/interactiveMaps_TEST"), # lat long columns lon = "decimalLongitude", lat = "decimalLatitude", # Occurrence dataset column with species names speciesColumn = "scientificName", # Which species to map - a character vector of names or "ALL" # Note: "ALL" is defined AFTER filtering for country speciesList = "ALL", # studyArea countryList = NULL, # Point jitter to see stacked points - jitters an amount in decimal degrees jitterValue = 0.01, # If TRUE, it will only map the .summary column. Otherwise, it will map .summary # which will be over-written by countryOutliers and manualOutliers onlySummary = TRUE, excludeApis_mellifera = TRUE, overWrite = TRUE, # Colours for points which are flagged as TRUE, FALSE, countryOutlier, and customOutlier pointColours = c("blue", "darkred","#ff7f00", "black") ) } # END if require
Because the bdc::bdc_country_from_coordinates()
function is very RAM-intensive, this wrapper
allows a user to specify chunk-sizes and only analyse a small portion of the occurrence data at a
time. The prefix jbd_ is used to highlight the difference between this function and the original
bdc::bdc_country_from_coordinates()
.
jbd_CfC_chunker( data = NULL, lat = "decimalLatitude", lon = "decimalLongitude", country = "country", stepSize = 1e+06, chunkStart = 1, scale = "medium", path = tempdir(), mc.cores = 1 )
jbd_CfC_chunker( data = NULL, lat = "decimalLatitude", lon = "decimalLongitude", country = "country", stepSize = 1e+06, chunkStart = 1, scale = "medium", path = tempdir(), mc.cores = 1 )
data |
A data frame or tibble. Occurrence records to use as input. |
lat |
Character. The name of the column to use as latitude. Default = "decimalLatitude". |
lon |
Character. The name of the column to use as longitude. Default = "decimalLongitude". |
country |
Character. The name of the column containing country names. Default = "country. |
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
chunkStart |
Numeric. The chunk number to start from. This can be > 1 when you need to restart the function from a certain chunk. For example, can be used if R failed unexpectedly. |
scale |
Passed to rnaturalearth's ne_countries(). Scale of map to return, one of 110, 50, 10 or 'small', 'medium', 'large'. Default = "large". |
path |
Character. The directory path to a folder in which to save the running countrylist csv file. |
mc.cores |
Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1. |
A data frame containing database_ids and a country column that needs to be re-merged with the data input.
if(requireNamespace("rnaturalearthdata")){ library("dplyr") data(beesFlagged) HomePath = tempdir() # Tibble of common issues in country names and their replacements commonProblems <- dplyr::tibble(problem = c('U.S.A.', 'US','USA','usa','UNITED STATES', 'United States','U.S.A','MX','CA','Bras.','Braz.','Brasil','CNMI','USA TERRITORY: PUERTO RICO'), fix = c('United States of America','United States of America', 'United States of America','United States of America', 'United States of America','United States of America', 'United States of America','Mexico','Canada','Brazil','Brazil', 'Brazil','Northern Mariana Islands','Puerto Rico')) beesFlagged <- beesFlagged %>% # Replace a name to test dplyr::mutate(country = stringr::str_replace_all(country, "Brazil", "Brasil")) beesFlagged_out <- countryNameCleanR( data = beesFlagged, commonProblems = commonProblems) suppressWarnings( countryOutput <- jbd_CfC_chunker(data = beesFlagged_out, lat = "decimalLatitude", lon = "decimalLongitude", country = "country", # How many rows to process at a time stepSize = 1000000, # Start row chunkStart = 1, path = HomePath, scale = "medium"), classes = "warning") # Left join these datasets beesFlagged_out <- left_join(beesFlagged_out, countryOutput, by = "database_id") %>% # merge the two country name columns into the "country" column dplyr::mutate(country = dplyr::coalesce(country.x, country.y)) %>% # remove the now redundant country columns dplyr::select(!c(country.x, country.y)) %>% # put the column back dplyr::relocate(country) %>% # Remove duplicates if they arose! dplyr::distinct() # Remove illegal characters beesFlagged_out$country <- beesFlagged_out$country %>% stringr::str_replace(., pattern = paste("\\[", "\\]", "\\?", sep= "|"), replacement = "") } # END if require
if(requireNamespace("rnaturalearthdata")){ library("dplyr") data(beesFlagged) HomePath = tempdir() # Tibble of common issues in country names and their replacements commonProblems <- dplyr::tibble(problem = c('U.S.A.', 'US','USA','usa','UNITED STATES', 'United States','U.S.A','MX','CA','Bras.','Braz.','Brasil','CNMI','USA TERRITORY: PUERTO RICO'), fix = c('United States of America','United States of America', 'United States of America','United States of America', 'United States of America','United States of America', 'United States of America','Mexico','Canada','Brazil','Brazil', 'Brazil','Northern Mariana Islands','Puerto Rico')) beesFlagged <- beesFlagged %>% # Replace a name to test dplyr::mutate(country = stringr::str_replace_all(country, "Brazil", "Brasil")) beesFlagged_out <- countryNameCleanR( data = beesFlagged, commonProblems = commonProblems) suppressWarnings( countryOutput <- jbd_CfC_chunker(data = beesFlagged_out, lat = "decimalLatitude", lon = "decimalLongitude", country = "country", # How many rows to process at a time stepSize = 1000000, # Start row chunkStart = 1, path = HomePath, scale = "medium"), classes = "warning") # Left join these datasets beesFlagged_out <- left_join(beesFlagged_out, countryOutput, by = "database_id") %>% # merge the two country name columns into the "country" column dplyr::mutate(country = dplyr::coalesce(country.x, country.y)) %>% # remove the now redundant country columns dplyr::select(!c(country.x, country.y)) %>% # put the column back dplyr::relocate(country) %>% # Remove duplicates if they arose! dplyr::distinct() # Remove illegal characters beesFlagged_out$country <- beesFlagged_out$country %>% stringr::str_replace(., pattern = paste("\\[", "\\]", "\\?", sep= "|"), replacement = "") } # END if require
Compares stated country name in an occurrence record with record's coordinates using
rnaturalearth data. The prefix, jbd_ is meant
to distinguish this function from the original bdc::bdc_coordinates_country_inconsistent()
.
This functions will preferably use the countryCode and country_suggested columns
generated by bdc::bdc_country_standardized()
; please run it on your dataset prior to running
this function.
jbd_coordCountryInconsistent( data = NULL, lon = "decimalLongitude", lat = "decimalLatitude", scale = 50, pointBuffer = 0.01, stepSize = 1e+06, mc.cores = 1 )
jbd_coordCountryInconsistent( data = NULL, lon = "decimalLongitude", lat = "decimalLatitude", scale = 50, pointBuffer = 0.01, stepSize = 1e+06, mc.cores = 1 )
data |
A data frame or tibble. Occurrence records as input. |
lon |
Character. The name of the column to use as longitude. Default = "decimalLongitude". |
lat |
Character. The name of the column to use as latitude. Default = "decimalLatitude". |
scale |
Numeric or character. To be passed to |
pointBuffer |
Numeric. Amount to buffer points, in decimal degrees. If the point is outside of a country, but within this point buffer, it will not be flagged. Default = 0.01. |
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
mc.cores |
Numeric. If > 1, the st_intersects function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1. |
The input occurrence data with a new column, .coordinates_country_inconsistent
if(requireNamespace("rnaturalearthdata")){ beesRaw_out <- jbd_coordCountryInconsistent( data = BeeBDC::beesRaw, lon = "decimalLongitude", lat = "decimalLatitude", scale = 50, pointBuffer = 0.01) } # END if require
if(requireNamespace("rnaturalearthdata")){ beesRaw_out <- jbd_coordCountryInconsistent( data = BeeBDC::beesRaw, lon = "decimalLongitude", lat = "decimalLatitude", scale = 50, pointBuffer = 0.01) } # END if require
This function flags occurrences where BOTH latitude and longitude values are rounded. This contrasts with the original function, bdc::bdc_coordinates_precision() that will flag occurrences where only one of latitude OR longitude are rounded. The BeeBDC approach saves occurrences that may have had terminal zeros rounded in one coordinate column.
jbd_coordinates_precision( data, lat = "decimalLatitude", lon = "decimalLongitude", ndec = NULL, quieter = FALSE )
jbd_coordinates_precision( data, lat = "decimalLatitude", lon = "decimalLongitude", ndec = NULL, quieter = FALSE )
data |
A data frame or tibble. Occurrence records as input. |
lat |
Character. The name of the column to use as latitude. Default = "decimalLatitude". |
lon |
Character. The name of the column to use as longitude. Default = "decimalLongitude". |
ndec |
Numeric. The number of decimal places to flag in decimal degrees. For example, argument value of 2 would flag occurrences with nothing in the hundredths place (0.0x). |
quieter |
Logical. If TRUE, the functino will run a little quieter. Default = FALSE. |
Returns the input data frame with a new column, .rou, where FALSE indicates occurrences that failed the test.
beesRaw_out <- jbd_coordinates_precision( data = BeeBDC::beesRaw, lon = "decimalLongitude", lat = "decimalLatitude", # number of decimals to be tested ndec = 2 ) table(beesRaw_out$.rou, useNA = "always")
beesRaw_out <- jbd_coordinates_precision( data = BeeBDC::beesRaw, lon = "decimalLongitude", lat = "decimalLatitude", # number of decimals to be tested ndec = 2 ) table(beesRaw_out$.rou, useNA = "always")
This function flags and corrects records when latitude and longitude appear
to be transposed.
This function will preferably use the countryCode column generated by
bdc::bdc_country_standardized()
.
jbd_coordinates_transposed( data, idcol = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE, fileName = NULL, scale = "large", path = NULL, mc.cores = 1 )
jbd_coordinates_transposed( data, idcol = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE, fileName = NULL, scale = "large", path = NULL, mc.cores = 1 )
data |
A data frame or tibble. Containing a unique identifier for each record, geographical coordinates, and country names. Coordinates must be expressed in decimal degrees and WGS84. |
idcol |
A character string. The column name with a unique record identifier. Default = "database_id". |
sci_names |
A character string. The column name with species' scientific names. Default = "scientificName". |
lat |
A character string. The column name with latitudes. Coordinates must be expressed in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
A character string. The column name with longitudes. Coordinates must be expressed in decimal degrees and WGS84. Default = "decimalLongitude". |
country |
A character string. The column name with the country assignment of each occurrence record. Default = "country". |
countryCode |
A character string. The column name containing an ISO-2 country code for each record. |
border_buffer |
Numeric. Must have value greater than or equal to 0. A distance in decimal degrees used to created a buffer around each country. Records within a given country and at a specified distance from the border will be not be corrected. Default = 0.2 (~22 km at the equator). |
save_outputs |
Logical. Indicates if a table containing transposed coordinates should be saved for further inspection. Default = FALSE. |
fileName |
A character string. The out file's name. |
scale |
Passed to rnaturalearth's ne_countries(). Scale of map to return, one of 110, 50, 10 or 'small', 'medium', 'large'. Default = "large". |
path |
A character string. A path as a character vector for where to create the directories
and save the figures. If
no path is provided (the default), the directories will be created using |
mc.cores |
Numeric. If > 1, the jbd_correct_coordinates function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1.#' |
This test identifies transposed coordinates based on mismatches between the country provided for a record and the record's latitude and longitude coordinates. Transposed coordinates often fall outside of the indicated country (i.e., in other countries or in the sea). Different coordinate transformations are performed to correct country/coordinates mismatches. Importantly, verbatim coordinates are replaced by the corrected ones in the returned database. A database containing verbatim and corrected coordinates is created in "Output/Check/01_coordinates_transposed.csv" if save_outputs == TRUE. The columns "country" and "countryCode" can be retrieved by using the function bdc::bdc_country_standardized.
A tibble containing the column "coordinates_transposed" which indicates if verbatim coordinates were not transposed (TRUE). Otherwise records are flagged as (FALSE) and, in this case, verbatim coordinates are replaced by corrected coordinates.
if(requireNamespace("rnaturalearthdata")){ database_id <- c(1, 2, 3, 4) scientificName <- c( "Rhinella major", "Scinax ruber", "Siparuna guianensis", "Psychotria vellosiana" ) decimalLatitude <- c(63.43333, -14.43333, -41.90000, -46.69778) decimalLongitude <- c(-17.90000, -67.91667, -13.25000, -13.82444) country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil") x <- data.frame( database_id, scientificName, decimalLatitude, decimalLongitude, country ) # Get country codes x <- bdc::bdc_country_standardized(data = x, country = "country") jbd_coordinates_transposed( data = x, idcol = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE, scale = "medium" ) } # END if require
if(requireNamespace("rnaturalearthdata")){ database_id <- c(1, 2, 3, 4) scientificName <- c( "Rhinella major", "Scinax ruber", "Siparuna guianensis", "Psychotria vellosiana" ) decimalLatitude <- c(63.43333, -14.43333, -41.90000, -46.69778) decimalLongitude <- c(-17.90000, -67.91667, -13.25000, -13.82444) country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil") x <- data.frame( database_id, scientificName, decimalLatitude, decimalLongitude, country ) # Get country codes x <- bdc::bdc_country_standardized(data = x, country = "country") jbd_coordinates_transposed( data = x, idcol = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE, scale = "medium" ) } # END if require
Creates figures (i.e., bar plots, maps, and histograms) reporting the results
of data quality tests implemented the bdc and BeeBDC packages. Works like bdc::bdc_create_figures()
,
but it allows the user to specify a save path.
jbd_create_figures( data, path = OutPath_Figures, database_id = "database_id", workflow_step = NULL, save_figures = FALSE )
jbd_create_figures( data, path = OutPath_Figures, database_id = "database_id", workflow_step = NULL, save_figures = FALSE )
data |
A data frame or tibble. Needs to contain the results of data quality tests; that is, columns starting with ".". |
path |
A character directory. The path to a directory in which to save the figures. Default = OutPath_Figures. |
database_id |
A character string. The column name with a unique record identifier. Default = "database_id". |
workflow_step |
A character string. Name of the workflow step. Options available are "prefilter", "space", and "time". |
save_figures |
Logical. Indicates if the figures should be saved for further inspection or use. Default = FALSE. |
This function creates figures based on the results of data quality tests. A pre-defined list of test names is used for creating figures depending on the name of the workflow step informed. Figures are saved in "Output/Figures" if save_figures = TRUE.
List containing figures showing the results of data quality tests implemented in one module of bdc/BeeBDC. When save_figures = TRUE, figures are also saved locally in a .png format.
database_id <- c("GBIF_01", "GBIF_02", "GBIF_03", "FISH_04", "FISH_05") lat <- c(-19.93580, -13.01667, -22.34161, -6.75000, -15.15806) lon <- c(-40.60030, -39.60000, -49.61017, -35.63330, -39.52861) .scientificName_emptys <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .coordinates_empty <- c(TRUE, TRUE, TRUE, TRUE, TRUE) .invalid_basis_of_records <- c(TRUE, FALSE, TRUE, FALSE, TRUE) .summary <- c(TRUE, FALSE, TRUE, FALSE, FALSE) x <- data.frame( database_id, lat, lon, .scientificName_emptys, .coordinates_empty, .invalid_basis_of_records, .summary ) figures <- jbd_create_figures( data = x, database_id = "database_id", workflow_step = "prefilter", save_figures = FALSE )
database_id <- c("GBIF_01", "GBIF_02", "GBIF_03", "FISH_04", "FISH_05") lat <- c(-19.93580, -13.01667, -22.34161, -6.75000, -15.15806) lon <- c(-40.60030, -39.60000, -49.61017, -35.63330, -39.52861) .scientificName_emptys <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .coordinates_empty <- c(TRUE, TRUE, TRUE, TRUE, TRUE) .invalid_basis_of_records <- c(TRUE, FALSE, TRUE, FALSE, TRUE) .summary <- c(TRUE, FALSE, TRUE, FALSE, FALSE) x <- data.frame( database_id, lat, lon, .scientificName_emptys, .coordinates_empty, .invalid_basis_of_records, .summary ) figures <- jbd_create_figures( data = x, database_id = "database_id", workflow_step = "prefilter", save_figures = FALSE )
Because the jbd_coordinates_transposed()
function is very RAM-intensive, this wrapper
allows a user to specify chunk-sizes and only analyse a small portion of the occurrence data at a
time. The prefix jbd_ is used to highlight the difference between this function and the original
bdc::bdc_coordinates_transposed()
.
This function will preferably use the countryCode column generated by
bdc::bdc_country_standardized()
.
jbd_Ctrans_chunker( data = NULL, lat = "decimalLatitude", lon = "decimalLongitude", idcol = "databse_id", country = "country_suggested", countryCode = "countryCode", sci_names = "scientificName", border_buffer = 0.2, save_outputs = TRUE, stepSize = 1e+06, chunkStart = 1, progressiveSave = TRUE, path = tempdir(), append = TRUE, scale = "large", mc.cores = 1 )
jbd_Ctrans_chunker( data = NULL, lat = "decimalLatitude", lon = "decimalLongitude", idcol = "databse_id", country = "country_suggested", countryCode = "countryCode", sci_names = "scientificName", border_buffer = 0.2, save_outputs = TRUE, stepSize = 1e+06, chunkStart = 1, progressiveSave = TRUE, path = tempdir(), append = TRUE, scale = "large", mc.cores = 1 )
data |
A data frame or tibble. Occurrence records as input. |
lat |
Character. The column with latitude in decimal degrees. Default = "decimalLatitude". |
lon |
Character. The column with longitude in decimal degrees. Default = "decimalLongitude". |
idcol |
Character. The column name with a unique record identifier. Default = "database_id". |
country |
Character. The name of the column containing country names. Default = "country". |
countryCode |
Character. Identifies the column containing ISO-2 country codes Default = "countryCode". |
sci_names |
Character. The column containing scientific names. Default = "scientificName". |
border_buffer |
Numeric. The buffer, in decimal degrees, around points to help match them to countries. Default = 0.2 (~22 km at equator). |
save_outputs |
Logical. If TRUE, transposed occurrences will be saved to their own file. |
stepSize |
Numeric. The number of occurrences to process in each chunk. Default = 1000000. |
chunkStart |
Numeric. The chunk number to start from. This can be > 1 when you need to restart the function from a certain chunk; for example if R failed unexpectedly. |
progressiveSave |
Logical. If TRUE then the country output list will be saved between
each iteration so that |
path |
Character. The path to a file in which to save the 01_coordinates_transposed_ output. |
append |
Logical. If TRUE, the function will look to append an existing file. |
scale |
Passed to rnaturalearth's ne_countries(). Scale of map to return, one of 110, 50, 10 or 'small', 'medium', 'large'. Default = "large". |
mc.cores |
Numeric. If > 1, the jbd_correct_coordinates function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1.#' |
Returns the input data frame with a new column, coordinates_transposed, where FALSE = columns that had coordinates transposed.
if(requireNamespace("rnaturalearthdata")){ library(dplyr) # Import and prepare the data data(beesFlagged) beesFlagged <- beesFlagged %>% dplyr::select(!c(.val, .sea)) %>% # Cut down the dataset to un example quicker dplyr::filter(dplyr::row_number() %in% 1:20) # Run the function beesFlagged_out <- jbd_Ctrans_chunker( # bdc_coordinates_transposed inputs data = beesFlagged, idcol = "database_id", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", # in decimal degrees (~22 km at the equator) border_buffer = 1, save_outputs = FALSE, sci_names = "scientificName", # chunker inputs # How many rows to process at a time stepSize = 1000000, # Start row chunkStart = 1, # Progressively save the output between each iteration? progressiveSave = FALSE, path = tempdir(), # If FALSE it may overwrite existing dataset append = FALSE, # Users should select scale = "large" as it is more thoroughly tested scale = "medium", mc.cores = 1 ) table(beesFlagged_out$coordinates_transposed, useNA = "always") } # END if require
if(requireNamespace("rnaturalearthdata")){ library(dplyr) # Import and prepare the data data(beesFlagged) beesFlagged <- beesFlagged %>% dplyr::select(!c(.val, .sea)) %>% # Cut down the dataset to un example quicker dplyr::filter(dplyr::row_number() %in% 1:20) # Run the function beesFlagged_out <- jbd_Ctrans_chunker( # bdc_coordinates_transposed inputs data = beesFlagged, idcol = "database_id", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", # in decimal degrees (~22 km at the equator) border_buffer = 1, save_outputs = FALSE, sci_names = "scientificName", # chunker inputs # How many rows to process at a time stepSize = 1000000, # Start row chunkStart = 1, # Progressively save the output between each iteration? progressiveSave = FALSE, path = tempdir(), # If FALSE it may overwrite existing dataset append = FALSE, # Users should select scale = "large" as it is more thoroughly tested scale = "medium", mc.cores = 1 ) table(beesFlagged_out$coordinates_transposed, useNA = "always") } # END if require
Uses expert-identified outliers with source spreadsheets that may be edited by users. The function
will also use the duplicates file made using dupeSummary()
to identify duplicates of the
expert-identified outliers and flag those as well.
The function will add a flagging column called .expertOutlier
where records that are FALSE are
the expert outliers.
manualOutlierFindeR( data = NULL, DataPath = NULL, PaigeOutliersName = "removedBecauseDeterminedOutlier.csv", newOutliersName = "All_outliers_ANB.xlsx", ColombiaOutliers_all = "All_Colombian_OutlierIDs.csv", duplicates = NULL, NearTRUE = NULL, NearTRUE_threshold = 5 )
manualOutlierFindeR( data = NULL, DataPath = NULL, PaigeOutliersName = "removedBecauseDeterminedOutlier.csv", newOutliersName = "All_outliers_ANB.xlsx", ColombiaOutliers_all = "All_Colombian_OutlierIDs.csv", duplicates = NULL, NearTRUE = NULL, NearTRUE_threshold = 5 )
data |
A data frame or tibble. Occurrence records as input. |
DataPath |
A character path to the directory that contains the outlier spreadsheets. |
PaigeOutliersName |
A character patch. Should lead to outlier spreadsheet from Paige Chesshire (csv file). |
newOutliersName |
A character path. Should lead to appropriate outlier spreadsheet (xlsx file). |
ColombiaOutliers_all |
A character path. Should lead to spreadsheet of bee outliers from Colombia (csv file). |
duplicates |
A data frame or tibble. The duplicate file produced by |
NearTRUE |
Optional. A character file name to the csv file. If you want to remove expert outliers that are too close to TRUE points, use the name of the NearTRUE.csv. Note: This implementation is only basic for now unless there is a greater need in the future. |
NearTRUE_threshold |
Numeric. The threshold (in km) for the distance to TRUE points to keep expert outliers. |
Returns the data with a new column, .expertOutlier
where records that are FALSE are
the expert outliers.
## Not run: # Read example data data(beesFlagged) # Read in the most-recent duplicates file as well if(!exists("duplicates")){ duplicates <- fileFinder(path = DataPath, fileName = "duplicateRun_") %>% readr::read_csv()} # identify the outliers and get a list of their database_ids beesFlagged_out <- manualOutlierFindeR( data = beesFlagged, DataPath = DataPath, PaigeOutliersName = "removedBecauseDeterminedOutlier.csv", newOutliersName = "^All_outliers_ANB_14March.xlsx", ColombiaOutliers_all = "All_Colombian_OutlierIDs.csv", duplicates = duplicates) ## End(Not run)
## Not run: # Read example data data(beesFlagged) # Read in the most-recent duplicates file as well if(!exists("duplicates")){ duplicates <- fileFinder(path = DataPath, fileName = "duplicateRun_") %>% readr::read_csv()} # identify the outliers and get a list of their database_ids beesFlagged_out <- manualOutlierFindeR( data = beesFlagged, DataPath = DataPath, PaigeOutliersName = "removedBecauseDeterminedOutlier.csv", newOutliersName = "^All_outliers_ANB_14March.xlsx", ColombiaOutliers_all = "All_Colombian_OutlierIDs.csv", duplicates = duplicates) ## End(Not run)
Replaces publicly available data with data that has been manually cleaned and error-corrected for use in the paper Chesshire, P. R., Fischer, E. E., Dowdy, N. J., Griswold, T., Hughes, A. C., Orr, M. J., . . . McCabe, L. M. (In Press). Completeness analysis for over 3000 United States bee species identifies persistent data gaps. Ecography.
PaigeIntegrater(db_standardized = NULL, PaigeNAm = NULL, columnStrings = NULL)
PaigeIntegrater(db_standardized = NULL, PaigeNAm = NULL, columnStrings = NULL)
db_standardized |
A data frame or tibble. Occurrence records as input. |
PaigeNAm |
A data frame or tibble. The Paige Chesshire dataset. |
columnStrings |
A list of character vectors. Each vector is a set of columns that will be used to iteratively match the public dataset against the Paige dataset. |
Returns db_standardized (input occurrence records) with the Paige Chesshire data integrated.
## Not run: library(dplyr) # set the DataPath to tempdir for this example DataPath <- tempdir() # Integrate Paige Chesshire's cleaned dataset. PaigeNAm <- readr::read_csv(paste(DataPath, "Paige_data", "NorAmer_highQual_only_ALLfamilies.csv", sep = "/"), col_types = ColTypeR()) %>% # Change the column name from Source to dataSource to match the rest of the data. dplyr::rename(dataSource = Source) %>% # add a NEW database_id column dplyr::mutate( database_id = paste0("Paige_data_", 1:nrow(.)), .before = scientificName) # Set up the list of character vectors to iteratively check for matches with public data. columnList <- list( c("decimalLatitude", "decimalLongitude", "recordNumber", "recordedBy", "individualCount", "samplingProtocol", "associatedTaxa", "sex", "catalogNumber", "institutionCode", "otherCatalogNumbers", "recordId", "occurrenceID", "collectionID"), # Iteration 1 c("catalogNumber", "institutionCode", "otherCatalogNumbers", "recordId", "occurrenceID", "collectionID"), # Iteration 2 c("decimalLatitude", "decimalLongitude", "recordedBy", "genus", "specificEpithet"), # Iteration 3 c("id", "decimalLatitude", "decimalLongitude"), # Iteration 4 c("recordedBy", "genus", "specificEpithet", "locality"), # Iteration 5 c("recordedBy", "institutionCode", "genus", "specificEpithet","locality"),# Iteration 6 c("occurrenceID","decimalLatitude", "decimalLongitude"), # Iteration 7 c("catalogNumber","decimalLatitude", "decimalLongitude"), # Iteration 8 c("catalogNumber", "locality") # Iteration 9 ) # Merge Paige's data with downloaded data db_standardized <- BeeBDC::PaigeIntegrater( db_standardized = db_standardized, PaigeNAm = PaigeNAm, columnStrings = columnList) ## End(Not run)
## Not run: library(dplyr) # set the DataPath to tempdir for this example DataPath <- tempdir() # Integrate Paige Chesshire's cleaned dataset. PaigeNAm <- readr::read_csv(paste(DataPath, "Paige_data", "NorAmer_highQual_only_ALLfamilies.csv", sep = "/"), col_types = ColTypeR()) %>% # Change the column name from Source to dataSource to match the rest of the data. dplyr::rename(dataSource = Source) %>% # add a NEW database_id column dplyr::mutate( database_id = paste0("Paige_data_", 1:nrow(.)), .before = scientificName) # Set up the list of character vectors to iteratively check for matches with public data. columnList <- list( c("decimalLatitude", "decimalLongitude", "recordNumber", "recordedBy", "individualCount", "samplingProtocol", "associatedTaxa", "sex", "catalogNumber", "institutionCode", "otherCatalogNumbers", "recordId", "occurrenceID", "collectionID"), # Iteration 1 c("catalogNumber", "institutionCode", "otherCatalogNumbers", "recordId", "occurrenceID", "collectionID"), # Iteration 2 c("decimalLatitude", "decimalLongitude", "recordedBy", "genus", "specificEpithet"), # Iteration 3 c("id", "decimalLatitude", "decimalLongitude"), # Iteration 4 c("recordedBy", "genus", "specificEpithet", "locality"), # Iteration 5 c("recordedBy", "institutionCode", "genus", "specificEpithet","locality"),# Iteration 6 c("occurrenceID","decimalLatitude", "decimalLongitude"), # Iteration 7 c("catalogNumber","decimalLatitude", "decimalLongitude"), # Iteration 8 c("catalogNumber", "locality") # Iteration 9 ) # Merge Paige's data with downloaded data db_standardized <- BeeBDC::PaigeIntegrater( db_standardized = db_standardized, PaigeNAm = PaigeNAm, columnStrings = columnList) ## End(Not run)
Creates a compound bar plot that shows the proportion of records that pass or fail each flag (rows) and for each data source (columns). The function can also optionally return a point map for a user-specified species when plotMap = TRUE. This function requires that your dataset has been run through some filtering functions - so that is can display logical columns starting with ".".
plotFlagSummary( data = NULL, flagColours = c("#127852", "#A7002D", "#BDBABB"), fileName = NULL, outPath = OutPath_Figures, width = 15, height = 9, units = "in", dpi = 300, bg = "white", device = "pdf", speciesName = NULL, saveFiltered = FALSE, filterColumn = ".summary", nameColumn = NULL, plotMap = FALSE, mapAlpha = 0.5, xbuffer = c(0, 0), ybuffer = c(0, 0), ptSize = 1, saveTable = FALSE, jitterValue = NULL, returnPlot = FALSE, ... )
plotFlagSummary( data = NULL, flagColours = c("#127852", "#A7002D", "#BDBABB"), fileName = NULL, outPath = OutPath_Figures, width = 15, height = 9, units = "in", dpi = 300, bg = "white", device = "pdf", speciesName = NULL, saveFiltered = FALSE, filterColumn = ".summary", nameColumn = NULL, plotMap = FALSE, mapAlpha = 0.5, xbuffer = c(0, 0), ybuffer = c(0, 0), ptSize = 1, saveTable = FALSE, jitterValue = NULL, returnPlot = FALSE, ... )
data |
A data frame or tibble. Occurrence records as input. |
flagColours |
A character vector. Colours in order of pass (TRUE), fail (FALSE), and NA. Default = c("#127852", "#A7002D", "#BDBABB"). |
fileName |
Character. The name of the file to be saved, ending in ".pdf".
If saving as a different file type, change file type suffix - See |
outPath |
A character path. The path to the directory in which the figure will be saved. Default = OutPath_Figures. |
width |
Numeric. The width of the output figure in user-defined units Default = 15. |
height |
Numeric. The height of the output figure in user-defined units Default = 9. |
units |
Character. The units for the figure width and height passed to |
dpi |
Numeric. Passed to |
bg |
Character. Passed to |
device |
Character. Passed to |
speciesName |
Optional. Character. A species name, as it occurs in the user-input nameColumn. If provided, the data will be filtered to this species for the plot. |
saveFiltered |
Optional. Logical. If TRUE, the filtered data will be saved to the computer as a .csv file. |
filterColumn |
Optional. The flag column to display on the map. Default = .summary. |
nameColumn |
Optional. Character. If speciesName is not NULL, enter the column to look for the species in. A User might realise that, combined with speciesName, figures can be made for a variety of factors. |
plotMap |
Logical. If TRUE, the function will produce a point map. Tested for use with one species at a time; i.e., with speciesName is not NULL. |
mapAlpha |
Optional. Numeric. The opacity for the points on the map. |
xbuffer |
Optional. Numeric vector. A buffer in degrees of the amount to increase the min and max bounds along the x-axis. This may require some experimentation, keeping in mind the negative and positive directionality of hemispheres. Default = c(0,0). |
ybuffer |
Optional. Numeric vector. A buffer in degrees of the amount to increase the min and max bounds along the y-axis. This may require some experimentation, keeping in mind the negative and positive directionality of hemispheres. Default = c(0,0). |
ptSize |
Optional. Numeric. The size of the points as passed to ggplot2. Default = 1. |
saveTable |
Optional. Logical. If TRUE, the function will save the data used to produce the compound bar plot. |
jitterValue |
Optional. Numeric. The value to jitter points by in the map in decimal degrees. |
returnPlot |
Logical. If TRUE, return the plot to the environment. Default = FALSE. |
... |
Optional. Extra variables to be fed into |
Exports a compound bar plot that summarises all flag columns. Optionally can also return a point map for a particular species in tandem with the summary plot.
# import data data(beesFlagged) OutPath_Figures <- tempdir() # Visualise all flags for each dataSource (simplified to the text before the first underscore) plotFlagSummary( data = beesFlagged, # Colours in order of pass (TRUE), fail (FALSE), and NA flagColours = c("#127852", "#A7002D", "#BDBABB"), fileName = paste0("FlagsPlot_TEST_", Sys.Date(),".pdf"), outPath = OutPath_Figures, width = 15, height = 9, # OPTIONAL: #\ # # Filter to species #\ speciesName = "Holcopasites heliopsis", #\ # column to look in #\ nameColumn = "species", #\ # Save the filtered data #\ saveFiltered = TRUE, #\ # Filter column to display on map #\ filterColumn = ".summary", #\ plotMap = TRUE, #\ # amount to jitter points if desired, e.g. 0.25 or NULL #\ jitterValue = NULL, #\ # Map opacity value for points between 0 and 1 #\ mapAlpha = 1, # Extra variables can be fed into forcats::fct_recode() to change names on plot GBIF = "GBIF", SCAN = "SCAN", iDigBio = "iDigBio", USGS = "USGS", ALA = "ALA", ASP = "ASP", CAES = "CAES", 'B. Mont.' = "BMont", 'B. Minkley' = "BMin", Ecd = "Ecd", Gaiarsa = "Gai", EPEL = "EPEL" )
# import data data(beesFlagged) OutPath_Figures <- tempdir() # Visualise all flags for each dataSource (simplified to the text before the first underscore) plotFlagSummary( data = beesFlagged, # Colours in order of pass (TRUE), fail (FALSE), and NA flagColours = c("#127852", "#A7002D", "#BDBABB"), fileName = paste0("FlagsPlot_TEST_", Sys.Date(),".pdf"), outPath = OutPath_Figures, width = 15, height = 9, # OPTIONAL: #\ # # Filter to species #\ speciesName = "Holcopasites heliopsis", #\ # column to look in #\ nameColumn = "species", #\ # Save the filtered data #\ saveFiltered = TRUE, #\ # Filter column to display on map #\ filterColumn = ".summary", #\ plotMap = TRUE, #\ # amount to jitter points if desired, e.g. 0.25 or NULL #\ jitterValue = NULL, #\ # Map opacity value for points between 0 and 1 #\ mapAlpha = 1, # Extra variables can be fed into forcats::fct_recode() to change names on plot GBIF = "GBIF", SCAN = "SCAN", iDigBio = "iDigBio", USGS = "USGS", ALA = "ALA", ASP = "ASP", CAES = "CAES", 'B. Mont.' = "BMont", 'B. Minkley' = "BMin", Ecd = "Ecd", Gaiarsa = "Gai", EPEL = "EPEL" )
Read in a variety of data files that are specific to certain smaller data providers. There is an internal readr function for each dataset and each one of these functions is called by readr_BeeBDC. While these functions are internal, they are displayed in the documentation of readr_BeeBDC for clarity.
readr_BeeBDC( dataset = NULL, path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = NULL ) readr_EPEL(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_ASP(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_BMin(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_BMont(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Ecd(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Gai(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_CAES( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Sheet1" ) readr_KP(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_EcoS(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_GeoL(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_EaCO(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MABC( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Hoja1" ) readr_Col( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = sheet ) readr_FSCA(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_SMC(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Bal( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "animal_data" ) readr_Lic(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Arm( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Sheet1" ) readr_Dor(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MEPB( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = NULL ) readr_BBD(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MPUJ( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = sheet ) readr_STRI(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_PALA(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_JoLa( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = c("pre-1950", "post-1950") ) readr_VicWam( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Combined" )
readr_BeeBDC( dataset = NULL, path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = NULL ) readr_EPEL(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_ASP(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_BMin(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_BMont(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Ecd(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Gai(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_CAES( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Sheet1" ) readr_KP(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_EcoS(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_GeoL(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_EaCO(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MABC( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Hoja1" ) readr_Col( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = sheet ) readr_FSCA(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_SMC(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Bal( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "animal_data" ) readr_Lic(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_Arm( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Sheet1" ) readr_Dor(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MEPB( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = NULL ) readr_BBD(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_MPUJ( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = sheet ) readr_STRI(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_PALA(path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL) readr_JoLa( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = c("pre-1950", "post-1950") ) readr_VicWam( path = NULL, inFile = NULL, outFile = NULL, dataLicense = NULL, sheet = "Combined" )
dataset |
Character. The name of the dataset to be read in. For example readr_CAES can be called using "readr_CAES" or "CAES". This is not caps sensitive. |
path |
A character path. The path to the directory containing the data. |
inFile |
Character or character path. The name of the file itself (can also be the remainder of a path including the file name). |
outFile |
Character or character path. The name of the Darwin Core format file to be saved. |
dataLicense |
Character. The license to accompany each record in the Darwin Core 'license' column. |
sheet |
A character String. For those datasets read from an .xlsx format, provide the sheet name. NOTE: This will be ignored for .csv readr_ functions and required for .xlsx readr_ functions. |
This function wraps several internal readr functions. Users may call readr_BeeBDC and select the dataset name to import a certain dataset. These datasets include:
Excel (.xlsx) formatted datasets: CAES, MABC, Col, Bal, MEPB, MUPJ, Arm, JoLa, and VicWam.
CSV (.csv) formatted datasets: EPEL, ASP, BMin, BMont, Ecd, Gai, KP, EcoS, GeoL, EaCo, FSCA, SMC, Lic, Dor, BBD, STRI, and PALA
See Dorey et al. 2023 BeeBDC... for further details.
A data frame that is in Darwin Core format.
readr_EPEL()
: Reads specific data files into Darwin Core format
readr_ASP()
: Reads specific data files into Darwin Core format
readr_BMin()
: Reads specific data files into Darwin Core format
readr_BMont()
: Reads specific data files into Darwin Core format
readr_Ecd()
: Reads specific data files into Darwin Core format
readr_Gai()
: Reads specific data files into Darwin Core format
readr_CAES()
: Reads specific data files into Darwin Core format
readr_KP()
: Reads specific data files into Darwin Core format
readr_EcoS()
: Reads specific data files into Darwin Core format
readr_GeoL()
: Reads specific data files into Darwin Core format
readr_EaCO()
: Reads specific data files into Darwin Core format
readr_MABC()
: Reads specific data files into Darwin Core format
readr_Col()
: Reads specific data files into Darwin Core format
readr_FSCA()
: Reads specific data files into Darwin Core format
readr_SMC()
: Reads specific data files into Darwin Core format
readr_Bal()
: Reads specific data files into Darwin Core format
readr_Lic()
: Reads specific data files into Darwin Core format
readr_Arm()
: Reads specific data files into Darwin Core format
readr_Dor()
: Reads specific data files into Darwin Core format
readr_MEPB()
: Reads specific data files into Darwin Core format
readr_BBD()
: Reads specific data files into Darwin Core format
readr_MPUJ()
: Reads specific data files into Darwin Core format
readr_STRI()
: Reads specific data files into Darwin Core format
readr_PALA()
: Reads specific data files into Darwin Core format
readr_JoLa()
: Reads specific data files into Darwin Core format
readr_VicWam()
: Reads specific data files into Darwin Core format
## Not run: # An example using a .xlsx file Arm_Data <- readr_BeeBDC( dataset = "Arm", path = paste0(tempdir(), "/Additional_Datasets"), inFile = "/InputDatasets/Bee database Armando_Final.xlsx", outFile = "jbd_Arm_Data.csv", sheet = "Sheet1", dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/") # An example using a .csv file EPEL_Data <- readr_BeeBDC( dataset = "readr_EPEL", path = paste0(tempdir(), "/Additional_Datasets"), inFile = "/InputDatasets/bee_data_canada.csv", outFile = "jbd_EPEL_data.csv", dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/") ## End(Not run)
## Not run: # An example using a .xlsx file Arm_Data <- readr_BeeBDC( dataset = "Arm", path = paste0(tempdir(), "/Additional_Datasets"), inFile = "/InputDatasets/Bee database Armando_Final.xlsx", outFile = "jbd_Arm_Data.csv", sheet = "Sheet1", dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/") # An example using a .csv file EPEL_Data <- readr_BeeBDC( dataset = "readr_EPEL", path = paste0(tempdir(), "/Additional_Datasets"), inFile = "/InputDatasets/bee_data_canada.csv", outFile = "jbd_EPEL_data.csv", dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/") ## End(Not run)
Find GBIF, ALA, iDigBio, and SCAN files in a directory
repoFinder(path)
repoFinder(path)
path |
A directory as character. The path within which to recursively look for GBIF, ALA, iDigBio, and SCAN files. |
Returns a list of directories to each of the above data downloads
## Not run: # Where DataPath is made by [BeeBDC::dirMaker()] BeeBDC::repoFinder(path = DataPath) ## End(Not run)
## Not run: # Where DataPath is made by [BeeBDC::dirMaker()] BeeBDC::repoFinder(path = DataPath) ## End(Not run)
Locates data from GBIF, ALA, iDigBio, and SCAN within a directory and reads it in along with its eml metadata. Please keep the original download folder names and architecture unchanged. NOTE: This function uses family-level data to identify taxon downloads. If this, or something new, becomes an issue, please contact James Dorey (the developer) as there are likely to be exceptions to how files are downloaded. current as of versions 1.0.4.
repoMerge(path, save_type, occ_paths)
repoMerge(path, save_type, occ_paths)
path |
A directory as a character. The directory to recursively look in for the above data. |
save_type |
Character. The data type to save the resulting file as. Options are: csv_files" or "R_file". |
occ_paths |
A list of directories. Preferably produced using |
A list with a data frame of merged occurrence records, "Data_WebDL", and a list of eml files contained in "eml_files". Also saves these files in the requested format.
## Not run: DataImp <- repoMerge(path = DataPath, # Find data - Many problems can be solved by running [BeeBDC::repoFinder(path = DataPath)] # And looking for problems occ_paths = BeeBDC::repoFinder(path = DataPath), save_type = "R_file") ## End(Not run)
## Not run: DataImp <- repoMerge(path = DataPath, # Find data - Many problems can be solved by running [BeeBDC::repoFinder(path = DataPath)] # And looking for problems occ_paths = BeeBDC::repoFinder(path = DataPath), save_type = "R_file") ## End(Not run)
Using all flag columns (column names starting with "."), this function either creates or updates the .summary flag column which is FALSE when ANY of the flag columns are FALSE. Columns can be excluded and removed after creating the .summary column. Additionally, the occurrence dataset can be filtered to only those where .summary = TRUE at the end of the function.
summaryFun( data = NULL, dontFilterThese = NULL, onlyFilterThese = NULL, removeFilterColumns = FALSE, filterClean = FALSE )
summaryFun( data = NULL, dontFilterThese = NULL, onlyFilterThese = NULL, removeFilterColumns = FALSE, filterClean = FALSE )
data |
A data frame or tibble. Occurrence records to use as input. |
dontFilterThese |
A character vector of flag columns to be ignored in the creation or updating of the .summary column. Cannot be specified with onlyFilterThese. |
onlyFilterThese |
A character vector. The inverse of dontFilterThese, where columns identified here will be filtered and no others. Cannot be specified with dontFilterThese. |
removeFilterColumns |
Logical. If TRUE all columns starting with "." will be removed in the output data. This only makes sense to use when filterClean = TRUE. Default = FALSE. |
filterClean |
Logical. If TRUE, the data will be filtered to only those occurrence where .summary = TRUE (i.e., completely clean according to the used flag columns). Default = FALSE. |
Returns a data frame or tibble of the input data but modified based on the above parameters.
# Read in example data data(beesFlagged) # To only update the .summary column beesFlagged_out <- summaryFun( data = beesFlagged, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".unLicensed"), removeFilterColumns = FALSE, filterClean = FALSE) # View output table(beesFlagged_out$.summary, useNA = "always") # Now filter to only the clean data and remove the flag columns beesFlagged_out <- summaryFun( data = BeeBDC::beesFlagged, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".unLicensed"), removeFilterColumns = TRUE, filterClean = TRUE) # View output table(beesFlagged_out$.summary, useNA = "always")
# Read in example data data(beesFlagged) # To only update the .summary column beesFlagged_out <- summaryFun( data = beesFlagged, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".unLicensed"), removeFilterColumns = FALSE, filterClean = FALSE) # View output table(beesFlagged_out$.summary, useNA = "always") # Now filter to only the clean data and remove the flag columns beesFlagged_out <- summaryFun( data = BeeBDC::beesFlagged, dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms", ".unLicensed"), removeFilterColumns = TRUE, filterClean = TRUE) # View output table(beesFlagged_out$.summary, useNA = "always")
Builds an output figure that shows the number of species and the number of occurrences per country. Breaks the data into classes for visualisation. Users may filter data to their taxa of interest to produce figures of interest.
summaryMaps( data = NULL, class_n = 15, class_Style = "fisher", outPath = NULL, fileName = NULL, width = 5, height = 10, dpi = 300, returnPlot = FALSE, scale = 110, pointBuffer = 0.01 )
summaryMaps( data = NULL, class_n = 15, class_Style = "fisher", outPath = NULL, fileName = NULL, width = 5, height = 10, dpi = 300, returnPlot = FALSE, scale = 110, pointBuffer = 0.01 )
data |
A data frame or tibble. Occurrence records as input. |
class_n |
Numeric. The number of categories to break the data into. |
class_Style |
Character. The class style passed to |
outPath |
A character vector the path to the save location for the output figure. |
fileName |
A character vector with file name for the output figure, ending with '.pdf'. |
width |
Numeric. The width, in inches, of the resulting figure. Default = 5. |
height |
Numeric. The height, in inches, of the resulting figure. Default = 10. |
dpi |
Numeric. The resolution of the resulting plot. Default = 300. |
returnPlot |
Logical. If TRUE, return the plot to the environment. Default = FALSE. |
scale |
Numeric or character. Passed to rnaturalearth's ne_countries(). Scale of map to return, one of 110, 50, 10 or 'small', 'medium', 'large'. Default = 110. |
pointBuffer |
Numeric. Amount to buffer points, in decimal degrees. If the point is outside of a country, but within this point buffer, it will count towards that country. It's a good idea to keep this value consistent with the prior flags applied. Default = 0.01. |
Saves a figure to the user-specified outpath and name with a global map of bee occurrence species and count data from the input dataset.
if(requireNamespace("rnaturalearthdata")){ # Read in data data(beesFlagged) OutPath_Figures <- tempdir() # This simple example using the test data has very few classes due to the small amount of input # data. summaryMaps( data = beesFlagged, width = 10, height = 10, class_n = 4, class_Style = "fisher", outPath = OutPath_Figures, fileName = paste0("CountryMaps_fisher_TEST.pdf"), ) } # END if require
if(requireNamespace("rnaturalearthdata")){ # Read in data data(beesFlagged) OutPath_Figures <- tempdir() # This simple example using the test data has very few classes due to the small amount of input # data. summaryMaps( data = beesFlagged, width = 10, height = 10, class_n = 4, class_Style = "fisher", outPath = OutPath_Figures, fileName = paste0("CountryMaps_fisher_TEST.pdf"), ) } # END if require
Uses the taxadb R package to download a requested taxonomy and then transforms it into the input
BeeBDC format. This means that any taxonomy in their databases can be used with BeeBDC. You can
also save the output to your computer and to the R environment for immediate use. See
details below for a list of providers or see taxadb::td_create()
.
taxadbToBeeBDC( name = NULL, rank = NULL, provider = "gbif", version = "22.12", collect = TRUE, ignore_case = TRUE, db = NULL, removeEmptyNames = TRUE, outPath = getwd(), fileName = NULL )
taxadbToBeeBDC( name = NULL, rank = NULL, provider = "gbif", version = "22.12", collect = TRUE, ignore_case = TRUE, db = NULL, removeEmptyNames = TRUE, outPath = getwd(), fileName = NULL )
name |
Character. Taxonomic scientific name (e.g. "Aves").
As defined by |
rank |
Character. Taxonomic rank name. (e.g. "class").
As defined by |
provider |
Character. From which provider should the hierarchy be returned?
Default is 'gbif', which can also be configured using options(default_taxadb_provide = ...").
See |
version |
Character. Which version of the taxadb provider database should we use? defaults
to latest. See tl_import for details. Default = 22.12.
As defined by |
collect |
Logical. Should we return an in-memory data.frame
(default, usually the most convenient), or a reference to lazy-eval table on disk
(useful for very large tables on which we may first perform subsequent filtering operations.).
Default = TRUE.
As defined by |
ignore_case |
Logical. should we ignore case (capitalization) in matching names?
Can be significantly slower to run. Default = TRUE.
As defined by |
db |
a connection to the taxadb database. See details of |
removeEmptyNames |
Logical. If True (default), it will remove entries without an entry for specificEpithet. |
outPath |
Character. The path to a directory (folder) in which the output should be saved. |
fileName |
Character. The name of the output file, ending in '.csv'. |
Returns a taxonomy file (to the R environment and to the disk, if a fileName is
provided) as a tibble that can be used with BeeBDC::harmoniseR()
.
beesTaxonomy()
for the bee taxonomy and harmoniseR()
for the
taxon-cleaning function where these taxonomies are implemented.
## Not run: # Run the function using the bee genus Apis as an example... ApisTaxonomy <- BeeBDC::taxadbToBeeBDC( name = "Apis", rank = "Genus", provider = "gbif", version = "22.12", removeEmptyNames = TRUE, outPath = getwd(), fileName = NULL ) ## End(Not run)
## Not run: # Run the function using the bee genus Apis as an example... ApisTaxonomy <- BeeBDC::taxadbToBeeBDC( name = "Apis", rank = "Genus", provider = "gbif", version = "22.12", removeEmptyNames = TRUE, outPath = getwd(), fileName = NULL ) ## End(Not run)
A small test checklist file for package tests. This dataset was built by filtering the checklist data from the three test datasets, beesFlagged, beesRaw, bees3sp.
data("testChecklist", package = "BeeBDC")
data("testChecklist", package = "BeeBDC")
An object of class "tibble"
The valid scientificName as it should occur in the scientificName column.
The full country name as it occurs on Discover Life.
Country name from rnaturalearth's name_long and type = "map_units".
A short version of the country name.
The continent where that country is found.
The ISO country name as it occurs on Discover Life.
Alpha-2 from rnaturalearth.
iso_a3_eh from rnaturalearth.
Official country name = "yes" or only a Discover Life name = "no".
A text strign denoting the source or author of the name-country pair.
Quality of the name's match to the Discover Life checklist.
The valid species name without scientificNameAuthority.
The validName without the scientificNameAuthority but with Discover Life flags.
Bee family.
Bee subfamily.
Bee genus.
Bee subgenus.
Bee infraSpecificEpithet.
Bee specificEpithet.
Bee scientificNameAuthorship.
Rank of the taxon name.
Discover Life country name notes.
This dataset is a subset of the beesChecklist file described in: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
beesRaw <- BeeBDC::testChecklist head(testChecklist)
beesRaw <- BeeBDC::testChecklist head(testChecklist)
A small test taxonomy file for package tests. This dataset was built by filtering the taxonomy data from the three test datasets, beesFlagged, beesRaw, bees3sp.
data("testTaxonomy", package = "BeeBDC")
data("testTaxonomy", package = "BeeBDC")
An object of class "tibble"
Taxonomic status. Values are "accepted" or "synonym"
Source of the name.
The id of the accepted taxon name or "0" if taxonomic_status == accepted.
The id number for the taxon name.
The biological kingdom the taxon belongs to. For bees, kingdom == Animalia.
The biological phylum the taxon belongs to. For bees, phylum == Arthropoda.
The biological class the taxon belongs to. For bees, class == Insecta.
The biological order the taxon belongs to. For bees, order == Hymenoptera.
The family of bee which the species belongs to.
The subfamily of bee which the species belongs to.
The tribe of bee which the species belongs to.
The subtribe of bee which the species belongs to.
The valid scientific name as it should occur in the 'scientificName" column in a Darwin Core file.
The scientificName without the scientificNameAuthority.
The scientificName without the scientificNameAuthority and with Discover Life taxonomy flags.
The genus the bee species belongs to.
The subgenus the bee species belongs to.
The specific epithet for the bee species.
The infraspecific epithet for the bee addressed.
The author who described the bee species.
Rank for the bee taxon addressed in the entry.
Additional notes about the name/taxon.
This dataset is a subset of the beesTaxonomy file described in: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
beesRaw <- BeeBDC::testTaxonomy head(testTaxonomy)
beesRaw <- BeeBDC::testTaxonomy head(testTaxonomy)
The function finds, imports, formats, and creates metadata for the USGS dataset.
USGS_formatter(path, pubDate)
USGS_formatter(path, pubDate)
path |
A character path to a directory that contains the USGS data, which will be found using
|
pubDate |
Character. The publication date of the dataset to update the metadata and citation. |
Returns a list with the occurrence data, "USGS_data", and the EML data, "EML_attributes".
## Not run: USGS_data <- USGS_formatter(path = DataPath, pubDate = "19-11-2022") ## End(Not run)
## Not run: USGS_data <- USGS_formatter(path = DataPath, pubDate = "19-11-2022") ## End(Not run)