| Title: | Data Thinning of Species Occurrences in Environmental Space |
|---|---|
| Description: | A suite of tools to mitigate sampling bias in species occurrence records by thinning data in the environmental space (E-space). This process can improve the accuracy and precision of species distribution models (SDM, also known as ecological niche models, ENM). The package offers a data-driven protocol to determine thinning parameters using kernel-density bandwidth selection. Two thinning methods are provided (stochastic and deterministic) to reduce over-sampled environmental conditions and down-weight outlier observations. The name 'bean' reflects the core principle of the method: each 'pod' (a grid cell in E-space) is allowed to contain only a limited number of 'beans' (occurrence points). See Silverman (1986, ISBN:978-0-412-24620-3) and Rousseeuw and Leroy (2003, ISBN:978-0-471-48855-2) for the underlying statistical methods. |
| Authors: | Paanwaris Paansri [cre, aut] (ORCID: <https://orcid.org/0000-0001-9992-098X>), Luis E. Escobar [aut, ctb] (ORCID: <https://orcid.org/0000-0001-5735-2750>) |
| Maintainer: | Paanwaris Paansri <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-05-30 17:03:54 UTC |
| Source: | https://github.com/cran/bean |
Calculates an objective, data-driven grid resolution for environmental thinning. For each environmental variable, the function selects a bandwidth for a kernel-density estimate (KDE) of the marginal distribution. The chosen bandwidth defines the spatial scale below which two observations carry essentially redundant information, and is therefore a natural choice for the edge length of an environmental grid cell.
Three established bandwidth selectors are supported (see Details):
"sheather-jones" (default) — the Sheather–Jones direct
plug-in estimator (Sheather & Jones, 1991), the modern recommended
default for non-Gaussian data;
"silverman" — Silverman's rule of thumb (Silverman, 1986);
"scott" — Scott's rule (Scott, 1992).
find_env_resolution( data, env_vars, method = c("sheather-jones", "silverman", "scott") )find_env_resolution( data, env_vars, method = c("sheather-jones", "silverman", "scott") )
data |
A |
env_vars |
A character vector specifying the environmental variables to analyse. |
method |
The bandwidth selector. One of |
Why a bandwidth? A good environmental grid cell should be small enough to distinguish ecologically meaningful differences, but large enough to absorb sampling noise. A kernel density bandwidth chosen from the data answers exactly that question: it is the scale at which the empirical density of observations becomes smooth. Using it as the grid resolution yields one occurrence per cell on average when the sampling intensity is near the mode of the data.
Selectors.
Sheather–Jones (stats::bw.SJ with method = "dpi")
is a plug-in selector that is robust for non-Gaussian densities and
is the standard recommendation in the modern literature (Sheather &
Jones, 1991; Jones, Marron & Sheather, 1996). Recommended default.
Silverman (stats::bw.nrd0) is the rule-of-thumb
(Silverman,
1986). Fast and stable, but assumes near-Gaussian shape.
Scott (stats::bw.nrd) is the Gaussian-optimal rule
(Scott, 1992). Simpler than
Silverman but less robust to outliers.
If "sheather-jones" fails (this can happen with strongly tied data),
the function falls back to Silverman's rule for that variable and emits a
message().
An object of class bean_env_resolution (a list) with:
suggested_resolutionA named numeric vector of the suggested grid resolution for each variable, in the units of that variable.
bandwidthsThe bandwidths used to derive each resolution
(identical to suggested_resolution).
density_dataA long-format data.frame of the kernel
density estimates, used by plot.bean_env_resolution.
methodThe bandwidth selector that was used.
Sheather, S. J. & Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B, 53(3), 683–690.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York.
Jones, M. C., Marron, J. S. & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433), 401–407.
thin_env_nd, thin_env_center,
bw.SJ, bw.nrd0.
set.seed(1) df <- data.frame( bio1 = c(rnorm(200, 15, 2), rnorm(50, 25, 1)), bio12 = c(rnorm(200, 1200, 200), rnorm(50, 2500, 100)) ) res <- find_env_resolution(df, env_vars = c("bio1", "bio12")) res$suggested_resolution plot(res)set.seed(1) df <- data.frame( bio1 = c(rnorm(200, 15, 2), rnorm(50, 25, 1)), bio12 = c(rnorm(200, 1200, 200), rnorm(50, 2500, 100)) ) res <- find_env_resolution(df, env_vars = c("bio1", "bio12")) res$suggested_resolution plot(res)
Fits an ellipsoid that encompasses a chosen proportion of the
data points in an environmental space of two or more dimensions. The
centroid and covariance matrix can be estimated either by the classical
sample moments ("covmat") or by the robust Minimum Volume Ellipsoid
("mve"; Rousseeuw, 1985). Points are classified as inside or outside
the ellipsoid using a cutoff on their squared Mahalanobis
distance.
fit_ellipsoid(data, env_vars, method = "covmat", level = 0.95)fit_ellipsoid(data, env_vars, method = "covmat", level = 0.95)
data |
A |
env_vars |
A character vector of at least two column names in
|
method |
One of |
level |
A single number in |
Methods. "covmat" uses the sample mean and sample covariance
matrix. It is optimal under multivariate normality but sensitive to
outliers. "mve" (Rousseeuw, 1985) finds the smallest-volume ellipsoid
that contains a fraction of the data and is robust to a moderate proportion
of contaminating points.
Confidence level. Assuming approximate multivariate normality, the
boundary of the ellipsoid is the set of points whose squared Mahalanobis
distance equals qchisq(level, df = n_dim).
An object of class bean_ellipsoid (a list) with:
centroidNamed vector of variable means / centre.
covariance_matrixThe covariance matrix used.
niche_ellipseA data.frame of polygon vertices for
the 2-D ellipse. NULL when more than two variables are
supplied (the 3-D mesh is generated lazily on plot).
all_points_usedComplete-case input data.
points_in_ellipseSubset inside the ellipsoid.
points_outside_ellipseSubset outside the ellipsoid.
inside_indicesRow indices (in all_points_used)
classified as inside.
parametersList with level and method.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications, Vol. B, 283–297.
Van Aelst, S. & Rousseeuw, P. (2009). Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 71–82.
Cobos, M. E., Osorio-Olvera, L., Soberón, J., Peterson, A. T., Barve, V. & Barve, N. (2024). ellipsenm: ecological niches' characterizations using ellipsoids. https://github.com/marlonecobos/ellipsenm.
set.seed(81) env_data <- data.frame( BIO1 = c(rnorm(50, 10, 1), 30), BIO12 = c(rnorm(50, 20, 2), 50) ) fit <- fit_ellipsoid(env_data, env_vars = c("BIO1", "BIO12"), method = "covmat", level = 0.95) print(fit) plot(fit)set.seed(81) env_data <- data.frame( BIO1 = c(rnorm(50, 10, 1), 30), BIO12 = c(rnorm(50, 20, 2), 50) ) fit <- fit_ellipsoid(env_data, env_vars = c("BIO1", "BIO12"), method = "covmat", level = 0.95) print(fit) plot(fit)
Raw, unthinned occurrence records for Rusa unicolor (Sambar deer) in Thailand. Used to demonstrate the spatial clustering and environmental bias typical of SDM datasets.
occ_data_rawocc_data_raw
A data.frame with one row per occurrence and the columns:
Species name.
Longitude (decimal degrees).
Latitude (decimal degrees).
Example dataset shipped with the bean package.
Output of prepare_bean applied to occ_data_raw using
the bundled environmental rasters. Missing coordinates and records outside
the raster extent have been removed, and environmental values have been
extracted and standardised.
origin_dat_preparedorigin_dat_prepared
A data.frame with the columns:
Species name.
Coordinates.
Scaled annual mean temperature.
Scaled temperature seasonality.
Scaled annual precipitation.
Scaled precipitation seasonality.
A bean_ellipsoid fitted to origin_dat_prepared
representing the baseline environmental niche of the species.
origin_ellipseorigin_ellipse
A bean_ellipsoid object (see fit_ellipsoid).
This function creates a scatterplot matrix (pairs plot) to visualize the results of n-dimensional environmental thinning using base R graphics. It can accept thinned objects from either density-based thinning ('thin_env_nd') or deterministic centroid thinning ('thin_env_center').
plot_bean(original_data, thinned_object, env_vars)plot_bean(original_data, thinned_object, env_vars)
original_data |
A data.frame of the prepared, unthinned occurrence points. |
thinned_object |
The output object from 'thin_env_nd()' or 'thin_env_center()'. |
env_vars |
A character vector of the environmental variables to plot. |
Invisibly returns 'NULL'. Draws a plot to the active graphics device.
data(origin_dat_prepared, package = "bean") env_vars <- c("bio_1", "bio_12") thinned <- thin_env_nd( data = origin_dat_prepared, env_vars = env_vars, grid_resolution = c(0.5, 0.5), seed = 1 ) plot_bean(origin_dat_prepared, thinned, env_vars = env_vars)data(origin_dat_prepared, package = "bean") env_vars <- c("bio_1", "bio_12") thinned <- thin_env_nd( data = origin_dat_prepared, env_vars = env_vars, grid_resolution = c(0.5, 0.5), seed = 1 ) plot_bean(origin_dat_prepared, thinned, env_vars = env_vars)
Computes Mahalanobis distance and suitability values deriving from a fitted
bean_ellipsoid object for new environmental data.
## S3 method for class 'bean_ellipsoid' predict( object, newdata, include_suitability = TRUE, suitability_truncated = FALSE, include_mahalanobis = TRUE, mahalanobis_truncated = FALSE, keep_data = NULL, ... )## S3 method for class 'bean_ellipsoid' predict( object, newdata, include_suitability = TRUE, suitability_truncated = FALSE, include_mahalanobis = TRUE, mahalanobis_truncated = FALSE, keep_data = NULL, ... )
object |
An object of class |
newdata |
Environmental predictors. Can be a |
include_suitability |
(logical) If |
suitability_truncated |
(logical) If |
include_mahalanobis |
(logical) If |
mahalanobis_truncated |
(logical) If |
keep_data |
(logical) If |
... |
Additional arguments (unused). |
A data.frame or SpatRaster (matching the input type of newdata)
containing the requested prediction layers.
This function serves as a pre-processing step to clean and prepare species occurrence data. It performs three key actions: 1. Removes records with missing longitude or latitude values. 2. Extracts environmental data from raster layers that are already scaled for each occurrence point. 3. Removes records that fall outside the raster extent or have missing environmental data. The final output is a clean data frame where the environmental variables have a mean of 0 and a standard deviation of 1.
prepare_bean( data, env_rasters, longitude, latitude, transform = c("scale", "pca", "none") )prepare_bean( data, env_rasters, longitude, latitude, transform = c("scale", "pca", "none") )
data |
A data.frame of species occurrences records, including columns for longitude and latitude. |
env_rasters |
A SpatRaster (from |
longitude |
(character) The name of the longitude column in |
latitude |
(character) The name of the latitude column in |
transform |
(character) The transformation to apply to the environmental rasters before extracting data. Options are "scale" (default), "pca", or "none". See Details. |
### Environmental Variable Transformation
The transform argument allows for different pre-processing of the
environmental raster layers to address issues like differing units and
multicollinearity.
- "scale" (Default):** This is the standard approach to handle variables with different units (e.g., °C vs. mm). It transforms each raster layer to have a mean of 0 and a standard deviation of 1 (Baddeley et al., 2016). This process makes the variables equal variance. As a result, each variable contributes equally to the analysis, ensuring that the resulting resolutions are based on the relative distribution of data points within each environmental dimension, not their arbitrary original units (Beaugrand, 2024; Kléparski et al., 2021).
- "pca": This option performs a Principal Component Analysis (PCA) on the environmental rasters. This is a powerful technique for dealing with multicollinearity (highly correlated variables). It transforms the original rasters into a new set of uncorrelated layers (Principal Components) (Qiao et al., 2016). The function then extracts the PC scores for each occurrence point.
- "none": This option extracts the raw environmental values from the rasters without any transformation. This is suitable if your rasters are already scaled or if you have a specific reason to use the raw values.
A data.frame containing the cleaned and scaled occurrence data, with the following columns:
Original Columns |
All columns from the input |
Environmental Variables |
New columns, named after the layers in |
Baddeley, A., Rubak, E. and Turner, R. (2016). Spatial point patterns: methodology and applications with R. CRC press.
Beaugrand, G. (2024). An ecological niche model that considers local relationships among variables: The Environmental String Model. Ecosphere, 15(10), e70015.
Kléparski, L., Beaugrand, G. and Edwards, M. (2021). Plankton biogeography in the North Atlantic Ocean and its adjacent seas: Species assemblages and environmental signatures. Ecology and Evolution, 11(10), 5135-5149.
Qiao, H., Peterson, A. T., Campbell, L. P., Soberón, J., Ji, L. and Escobar, L. E. (2016). NicheA: creating virtual species and ecological niches in multivariate environmental scenarios. Ecography, 39(8), 805-813.
env_file <- system.file("extdata", "thai_env.tif", package = "bean") occ_file <- system.file("extdata", "Rusa_unicolor.csv", package = "bean") if (nzchar(env_file) && nzchar(occ_file) && requireNamespace("terra", quietly = TRUE)) { env <- terra::rast(env_file) occ <- read.csv(occ_file) prepared <- prepare_bean( data = occ, env_rasters = env, longitude = "x", latitude = "y", transform = "scale" ) head(prepared) }env_file <- system.file("extdata", "thai_env.tif", package = "bean") occ_file <- system.file("extdata", "Rusa_unicolor.csv", package = "bean") if (nzchar(env_file) && nzchar(occ_file) && requireNamespace("terra", quietly = TRUE)) { env <- terra::rast(env_file) occ <- read.csv(occ_file) prepared <- prepare_bean( data = occ, env_rasters = env, longitude = "x", latitude = "y", transform = "scale" ) head(prepared) }
This function thins species occurrence records by finding all occupied cells in a 2D environmental grid and returning a single new point at the exact center of each of those cells. This is a deterministic method.
thin_env_center(data, env_vars, grid_resolution)thin_env_center(data, env_vars, grid_resolution)
data |
A data.frame containing species occurrence coordinates and the environmental variables. |
env_vars |
A character vector specifying the column names in data that represent the environmental variables to be used in the analysis. |
grid_resolution |
A numeric vector of length one or two specifying the resolution(s) for the grid axes. If length one, it is used for both axes. |
An object of class bean_thinned_center. which is a list containing:
thinned_points |
A data.frame with two columns representing the new points at the center of each occupied environmental grid cell. |
n_original |
An integer representing the number of complete occurrence records in the input data. |
n_thinned |
An integer representing the number of unique grid cells that were occupied, which is also the number of points returned. |
parameters |
A list of the key parameters used, such as whether scaling was applied. |
thin_env_nd, find_env_resolution
env_data <- data.frame( BIO1 = c(0.1, 0.2, 1.1, 1.2, 1.3), BIO12 = c(0.1, 0.2, 2.1, 2.2, 2.3) ) thin_env_center(env_data, env_vars = c("BIO1", "BIO12"), grid_resolution = c(0.1, 0.2))env_data <- data.frame( BIO1 = c(0.1, 0.2, 1.1, 1.2, 1.3), BIO12 = c(0.1, 0.2, 2.1, 2.2, 2.3) ) thin_env_center(env_data, env_vars = c("BIO1", "BIO12"), grid_resolution = c(0.1, 0.2))
This function thins species occurrence records in an n-dimensional environmental space by randomly sampling exactly one point from each occupied n-dimensional grid cell (hypercube).
thin_env_nd(data, env_vars, grid_resolution, seed = NULL)thin_env_nd(data, env_vars, grid_resolution, seed = NULL)
data |
A data.frame containing species occurrences and pre-scaled environmental variables, typically the output of 'prepare_bean()'. |
env_vars |
A character vector of two or more column names representing the environmental variables (dimensions) to use for thinning. |
grid_resolution |
A numeric vector of resolutions for each environmental axis. Its length must match the length of 'env_vars'. |
seed |
(numeric) An optional random seed for reproducibility. If provided, the random number generator state is safely isolated to this function call and will not affect the global environment. Default = NULL. |
An object of class 'bean_thinned', which is a list containing:
thinned_data |
A data.frame containing the occurrence records that were retained after the thinning process. |
n_original |
An integer representing the number of complete occurrence records in the input data before thinning. |
n_thinned |
An integer representing the number of occurrence records remaining after thinning. |
parameters |
A list of the key parameters used during the thinning process. |
data(origin_dat_prepared, package = "bean") thinned <- thin_env_nd( data = origin_dat_prepared, env_vars = c("bio_1", "bio_12"), grid_resolution = c(0.5, 0.5), seed = 123 ) print(thinned)data(origin_dat_prepared, package = "bean") thinned <- thin_env_nd( data = origin_dat_prepared, env_vars = c("bio_1", "bio_12"), grid_resolution = c(0.5, 0.5), seed = 123 ) print(thinned)
Result of thin_env_center applied to
origin_dat_prepared. Contains one calculated centroid per
occupied environmental grid cell.
thinned_deterministicthinned_deterministic
A bean_thinned_center object (see
thin_env_center).
Result of thin_env_nd applied to
origin_dat_prepared. Contains one randomly chosen occurrence
per occupied environmental grid cell.
thinned_stochasticthinned_stochastic
A bean_thinned object (see thin_env_nd).