Spatial subsampling tutorial


  • Why to subsample spatially

  • Prepare Paleobiology Database records

  • Subsampling examples

    • Circular subsampling

    • PBDB collections, references, and duplicate entries

    • Nearest-neighbour subsampling

    • Latitudinal subsampling

  • Analysis on subsampled data

  • References

This document introduces rationale for spatial subsampling and covers common use cases for divvy functions, including example R code. Sections below discuss ways to format data, implement three different subsampling routines, and calculate common biodiversity metrics.

For a more comprehensive review of spatial subsampling theory, applications in paleobiology and ecology, and implementation in the divvy package, refer to Antell, Benson, and Saupe (accepted), ‘Spatial standardization of taxon occurrence data—a call to action’.

Why to subsample spatially

Biodiversity measures, whether quantified from fossil or extant data, reflect the outcome of three types of processes: 1. the biological processes the measurements aim to capture, 2. stochastic processes that generate random error, and 3. observation processes, such as the distribution of collecting effort intensity. Compared to neontological data, observation densities of palaeontological data reflect the additional contributions of taphonomy (decay and preservation processes) and sedimentation (burial and lithification processes). The term ‘sampling’ is used herein to refer to the full suite of influences on how many of the fossil taxa that lived in a place are recorded as fossil occurrences there. Given the large number of processes involved in sampling fossils, it is unsurprising that for many palaeo-occurrence datasets, the combined effect size of unexplained error and variable observation probabilities on biodiversity parameter estimates is greater than that of true biological processes from the past.

The relative strength of sampling structure compared to biological structure is a well-known issue in quantitative palaeobiology and therefore has received considerable attention. The most mainstream method of standardising sampling in biodiversity data (fossil or extant) is rarefaction, a genre of statistical method that resamples taxon occurrences to a given threshold of sampling completeness. ‘Classical’ rarefaction uses cumulative specimen count as a proxy of sampling completeness, while ‘coverage-based’ rarefaction estimates the cumulative coverage of a taxon frequency distribution curve as a proxy (Alroy 2010; Chao and Jost 2012). Examples of both types of rarefaction abound in recent palaeobiology literature. Rarefaction largely succeeds in producing diversity estimates corrected for sampling differences between sites, i.e. diversity estimates at discrete sites (alpha diversity) that can be compared fairly. However, challenges arise in comparing diversity between times or regions that contain multiple sites.

Taxonomic composition turns over from one geographic location to another. The amount of this turnover (beta diversity) generally increases with distance, a relationship known as the species–area effect. In practical terms, the greater the spatial dispersion or area of sampling, the more taxa will tend to be recovered. Sampling processes control both the dispersion and area of localities with fossil data, which makes it essential for palaeobiologists to account for spatial coverage of sampling when conducting biodiversity analyses, regardless of the use of rarefaction (Benson et al. 2021). Without standardising sampling in a spatial context, any estimates of biodiversity parameters will be an indiscernible mix of true historic values conflated with the amount of observation of those values.

divvy implements several spatial subsampling methods that control both the geographic dispersion and area of taxon occurrences, so analysts can make fairer estimates of biodiversity parameters between regions or through time. Each method is iterative, drawing many, comparable spatial replicates.

Prepare Paleobiology Database records

We begin by loading the divvy package itself, and note its geospatial package dependencies units, sf, terra, and vegan, and the iNEXT package for taxonomic richness rarefaction.


divvy includes the attached dataset bivalves, a download of fossil occurrences from the Paleobiology Database (PBDB) that has been subset to a few relevant columns and minimally cleaned. bivalves contains ca. 8,000 (palaeo)coordinates and identifications for ca. 500 marine bivalve genera from the Pliocene. For more information about the dataset, query ?bivalves.

The attached dataset is provided only as an example for working with PBDB-structured occurrence records; formal analyses would be wise to vet downloaded data more rigorously, including revising taxonomy, time bin assignments, and environmental classifications. The fossilbrush package provides a palette of tools to help clean PBDB data.

#>         genus paleolng paleolat collection_no reference_no
#> 1        Lima   176.85   -42.69         34849         9315
#> 2 Nemocardium   176.85   -42.69         34849        11706
#> 3      Ostrea   176.85   -42.69         34849        11706
#> 4     Chlamys   176.85   -42.69         34849         9315
#> 5  Pycnodonte  -114.37    33.76         34857         9338
#> 6      Euvola  -114.37    33.76         34857         9338
#>              environment max_ma min_ma          accepted_name
#> 1             basin reef  5.333    3.6                   Lima
#> 2             basin reef  5.333    3.6 Nemocardium (Pratulum)
#> 3             basin reef  5.333    3.6       Ostrea chilensis
#> 4             basin reef  5.333    3.6                Chlamys
#> 5 marginal marine indet.  5.333    3.6   Pycnodonte heermanni
#> 6 marginal marine indet.  5.333    3.6           Euvola keepi
#> [1] 8095
#> [1] 550

The latitude-longitude coordinates in the PBDB come from many different studies, which means the precision and accuracy vary across records. Fossils collected from the same locality may be reported with different coordinates due purely to different GPS equipment, mapping, or decimal rounding, for example. The spatial subsampling procedures below put great weight on the number of unique localities in a region, so it is important to avoid inflating the site number artificially.

One standard way to smooth over slight discrepancies in occurrence positions is to convert point coordinates (‘vector’ spatial data) to grid cells (‘raster’ spatial data)1. Effectively, one lays a web over a study region and records the position of polygons (‘grid cells’) in which points fall, rather than the original xy point coordinates. An additional advantage of raster over vector data is the tremendous increase in efficiency of spatial computations.

Palaeobiologists often convert abundance counts to binary presence-absence data for taxon occurrences in grid cells. This practice is especially common for PBDB data because abundance information is usually non-standard, if it is recorded at all. (Read below for further notes about duplicate taxon occurrences.)

Many PBDB analyses are also global in extent, which makes the choice of coordinate reference system for the raster grid important. The areas and shapes of grid cells at the poles vs. the equator can differ widely with certain reference systems or map projections. The code below converts latitude-longitude coordinates to the Equal Earth reference system, an equal-area projection for world maps. Another option of friendly raster grid system is the is the icosa package, developed by palaeobiologist Ádám Kocsis, which creates tessellations of pentagons and hexagons. The polygons are approximately equal in area and shape across the globe. The spatial resolution of the grid is controlled by arguments in the hexagrid function.

# initialise Equal Earth projected coordinates
rWorld <- rast()
prj <- 'EPSG:8857'
rPrj <- project(rWorld, prj, res = 200000) # 200,000m is approximately 2 degrees
values(rPrj) <- 1:ncell(rPrj)

# coordinate column names for the current and target coordinate reference system
xyCartes <- c('paleolng','paleolat')
xyCell   <- c('cellX','cellY')

# retrieve coordinates of raster cell centroids
llOccs <- vect(bivalves, geom = xyCartes, crs = 'epsg:4326')
prjOccs <- project(llOccs, prj)
bivalves$cell <- cells(rPrj, prjOccs)[,'cell']
bivalves[, xyCell] <- xyFromCell(rPrj, bivalves$cell)

We can inspect the spatial distribution of data by converting occurrence points to spatial features (a class that contains coordinate system information) and plotting them against a basic world map, supplied here from the rnaturalearth and rnaturalearthdata packages.

occUniq <- uniqify(bivalves, xyCell)
ptsUniq <- st_as_sf(occUniq, coords = xyCell, crs = prj)

library(ggplot2) # for plot visualisation
world <- ne_countries(scale = "medium", returnclass = "sf")
worldP <- ggplot(data = world) +
  theme_bw() +
  geom_sf() +
  geom_sf(data = ptsUniq, shape = 17, color = 'blue')


Subsampling examples

divvy offers three approaches to spatially subsample data:

  • cookies: Imposes a radial constraint on the spatial bounds of a subsample and standardises area by rarefying the number of localities

  • clustr: Aggregates sites that are nearest neighbours (connecting them with a minimum spanning tree) to impose a maximum diameter on the spatial bounds of a subsample, and optionally rarefies localities

  • bandit: Rarefies the number of localities within bands of equal latitude

Circular subsampling

First let’s apply circular subsampling, which both constrains the spatial bounds of a sample to a specified radius from a random start point and standardises the spatial area of a sample to a specified number of sites. (Recall that sites were allocated to equal-area polygons in the preceding section.) The radius (1500 km) and the numbers of sites (n = 12) and iterations (n = 500) are specified here to match the subsampling parameters in the global analysis of Antell et al. (2020).

circLocs <- cookies(dat = bivalves,  
                    xy = xyCell,
                    iter = 500, 
                    nSite = 12, 
                    r = 1500, # radial distance in km
                    weight = TRUE, # probabilistically aggregate subsampling sites
                    crs = prj, # Equal Earth projection
                    output = 'locs')
#> [1] 500
#>         cellX   cellY
#> 1019 -6543959 4739440
#> 56   -7343959 3539440
#> 7586 -7143959 3939440
#> 1278 -7343959 3939440
#> 1049 -6543959 4539440
#> 2921 -7143959 3539440
#> 1054 -6743959 4339440
#> 1076 -6743959 4539440
#> 122  -7743959 3939440
#> 1251 -7543959 3939440
#> 2065 -7543959 3739440
#> 911  -6543959 4339440

Subsamples are returned as elements in a list of length iter. If output = "locs" (default), each element is a data.frame of coordinates for the nSite sites included in a subsample. This output may be useful as an intermediate object on which to run custom functions that calculate ecological parameters of interest, for instance metrics of spatial connectedness among fossil sites. We can also use location output to explore where the subsample plots on our earlier map. This subsample happens to fall along the East and Gulf Coasts of North America.

# over-plot the subsample locations
smplPts <- st_as_sf(circLocs[[1]], coords = xyCell, crs = prj)
worldP +
  geom_sf(data = smplPts, shape = 17, color = 'red')

Now note that the first code chunk in this section set the weight argument of cookies() to TRUE. Weighting is designed to cluster subsample sites more compactly than random selection; that is, distant points will tend to be left out. Weighting is achieved by increasing the probability of drawing a site the closer it falls to the central occurrence point (‘seed cell’) in a given subsample. The seed cell is always included in weighted subsamples (but not necessarily in unweighted ones) and corresponds to the first row listed in the output. To visibilise the weighted subsample method further, let’s extract the seed location and manually plot the circular constraint around it.

cntr <- smplPts[1,]
# distances inferred to be meters based on lat-long coord system
r <- 1500
buf <- st_buffer(cntr, dist = r*1000)

# over-plot subsample boundary region and included sites
worldP +
  geom_sf(data = smplPts, shape = 17, color = 'red') +
  geom_sf(data = buf, fill = NA, linewidth = 1, color = 'red')

This inspection also reveals that there happen to be 14 sites in the region from which to draw a subsample of 12.

inBuf <- st_intersects(ptsUniq, buf, sparse = FALSE) 
#> [1] 14

As demonstrated above, the location-type output of divvy subsampling functions can be useful in certain cases; however, more often researchers will want to retrieve taxon records. Changing output from "locs" to "full", each element of the returned object now contains the subset of occurrence rows from bivalves located at the sites in a subsample. This output will be useful for analysis in the final vignette section below.

# same parameter values as above except for 'output'
circOccs <- cookies(dat = bivalves, 
                    xy = xyCell, 
                    iter = 500, 
                    nSite = 12, 
                    r = 1500, 
                    weight = TRUE, 
                    crs = prj,
                    output = 'full')
head( circOccs[[8]] )
#>           genus paleolng paleolat collection_no reference_no    environment
#> 83      Anadara   -69.82    10.25         41552        11132  marine indet.
#> 409  Iliochione   -79.18     0.54         51887        13814       offshore
#> 410 Undulostrea   -79.18     0.54         51887        13814       offshore
#> 411  Argopecten   -79.18     0.54         51887        13814       offshore
#> 412  Argopecten   -79.18     0.54         51881        13814 coastal indet.
#> 413        Arca   -79.18     0.54         51881        13814 coastal indet.
#>     max_ma min_ma                accepted_name cell    cellX     cellY
#> 83   5.333  3.600 Anadara (Larkinia) notabilis 6074 -6543959 1339439.8
#> 409  5.333  2.588      Iliochione callistoides 7101 -7543959  139439.8
#> 410  5.333  2.588          Undulostrea megodon 7101 -7543959  139439.8
#> 411  5.333  2.588       Argopecten ventricosus 7101 -7543959  139439.8
#> 412  5.333  2.588       Argopecten ventricosus 7101 -7543959  139439.8
#> 413  5.333  2.588                         Arca 7101 -7543959  139439.8

PBDB collections, references, and duplicate entries

Data fed into divvy subsampling functions can contain duplicate taxon–location records, which are common from the collections-based format of PBDB data entry. For instance, the subsample output printed above shows Argopecten twice at the same coordinates. The duplicates stem from different collections (#51881 and #51887). In the previous section we standardised spatial dispersion and area, but at this point we could rarefy the number of collections or references, too, as a standardisation for sampling effort.

The study that developed the circular buffer approach for regional subsampling (implemented with cookies()) avoided rarefying collections/references, as this step would be largely redundant with rarefying sites/raster grid cells (Antell et al. 2020). The number of PBDB reference counts for marine invertebrate occurrences correlates nearly perfectly with grid cell counts (Alroy et al. 2008). Applying rarefaction to both grid cells and collections or references would compress the distribution of observed values, which could reduce the statistical power of analysis and heighten the risk of overlooking a true biological signal (type 2 error).

In contrast, the study that developed the nearest-neighbour subsampling approach (described below) rarefied collections within references, following Alroy (2014), but avoided rarefying sites/cells (Close et al. 2017). The richness estimation procedure involved drawing PBDB references, drawing up to three collections within each of those references, and evaluating only the taxon occurrences within those subsampled collections. The divvy diversity summary function (sdSumry()) and the most recent study to use nearest-neighbour subsamples (Close et al. 2020) call on the iNEXT implementation of coverage-based richness estimation, which ignores the PBDB collections/references data structure. Therefore, users wishing to to rarefy collections and/or references should write custom richness estimation scripts or adapt Alroy’s Perl scripts.

To filter out duplicate taxon–location records from a datset, thereby reducing object size and saving memory, divvy offers the uniqify() function. Omitting duplicate occurrences from bivalves removes more than 5,000 rows.

bivUniq <- uniqify(bivalves, taxVar = 'genus', xyCell)
#> [1] 3061

Nearest-neighbour subsampling

As mentioned in the preceding section, papers to date that analysed spatial subsamples constructed with the nearest-neighbour method constrained only the spatial dispersion and not cumulative area or number of sites (Close et al. 2017, 2020). For backwards compatibility with these originator publications, the clustr() function contains an option to turn off site rarefaction (argument nSite = NULL). However, depending on study design it may well be prudent to standardise the area/number of sites when using the nearest-neighbour method, as is mandated in the circular subsampling method. To set the site quota for either method, use the nSite argument.

nnLocs <- clustr(dat = bivalves, 
                 xy = xyCell, 
                 iter = 500, 
                 distMax = 3000, # diameter = 2x the circular radius set above
                 nSite = 12,
                 crs = prj
#>         cellX      cellY
#> 81   -6343959 1339439.78
#> 1891 -7343959  -60560.22
#> 5884 -7343959  539439.78
#> 6477 -7743959 -460560.22
#> 1491 -5743959 1339439.78
#> 2415 -7943959 1139439.78
#> 409  -7543959  139439.78
#> 5916 -6543959 1539439.78
#> 1936 -6743959 2339439.78
#> 3887 -7743959 1339439.78
#> 74   -5943959 1339439.78
#> 5751 -7143959 1339439.78

If we skip site rarefaction and instead include all locations within a cluster of maximum diameter deterministically, a subsample built on a given starting location could include any number of sites above the minimum threshold (here, nMin = 3, the default). The first replicate in this example contains 17 sites.

nnAllSites <- clustr(dat = bivalves, 
                     xy = xyCell, 
                     iter = 500,
                     distMax = 3000, # diameter = 2x the circular radius set above
                     nMin = 3,
                     crs = prj
nrow( nnAllSites[[1]] )
#> [1] 17

Latitudinal subsampling

Many biological and environmental variables of interest vary characteristically with latitude. Hence, depending on research question it may be exigent to control for latitudinal differences between items of comparison, e.g. occurrence data from a time step with predominantly low-latitude fossil localities vs. a time step with more mid-latitude localities.

The bandit() function returns subsamples of a given site quota within latitudinal bands of a given bin resolution/width. Optionally, the function will ignore hemisphere differences and consider absolute latitude. The iter argument in bandit() specifies the number of subsamples to take within each band, rather than the total number globally. No subsamples are returned in bands containing fewer than nSite localities.

bandLocs <- bandit(dat = bivalves,
                   xy = xyCell,
                   iter = 100, nSite = 12, 
                   bin = 20, # interval width in degrees
                   crs = prj
                   # ,absLat = TRUE # optional
nrow(bandLocs[[1]]) # number of sites in one subsampled band
#> [1] 12
length(bandLocs) # number of subsamples, tallied across all bands
#> [1] 500
unique(names(bandLocs)) # latitudinal degrees of subsampled intervals
#> [1] "[-50,-30)" "[-10,10)"  "[10,30)"   "[30,50)"   "[50,70)"

Analysis on subsampled data

The sdSumry() function returns a summary of the spatial characteristics of a dataset/subsample: number of unique locations, centroid coordinates, latitudinal range (degrees), great circle distance (km), and summed minimum spanning tree length (km) for occurrences. sdSumry() also tallies taxa in a sample and performs coverage-based rarefaction (if quotaQ supplied) and classical rarefaction (if quotaN supplied). Rarefied estimates are returned along with their associated 95% confidence interval (estimated by iNEXT).

When coverage-based rarefaction is applied, the specified coverage level (quotaQ) should be appropriate for the estimated coverage of the original sample. Any requested quota greater than empirically available will require diversity extrapolation. sdSumry() returns the estimated sample coverage whenever coverage-based rarefaction is applied. Also note, homogeneous evenness is an underlying assumption for fair comparisons of coverage-based rarefaction diversity estimates. To help check this assumption, sdSumry() also returns Pielou’s J evenness metric whenever coverage-based rarefaction is applied. Evenness metrics are an area of ongoing theoretical debate and methodological development in ecology, and the use of Pielou’s J in divvy is a reflection of this metrics’ widespread use rather than a singular endorsement. If J estimates vary widely between time intervals or other categories of comparison, it is worth investigating evenness differences in more nuanced detail, for instance by calculating a variety of metrics.

Compare the summary data from the original dataset vs. a single spatial subsample. First consider the summary of spatial metrics. The geographic dispersion of bivalves occurrences is enormous—this in itself indicates spatial standardisation is necessary to derive meaningful biological metrics. When we compare diversity estimates from the global vs. spatially-subsampled dataset, richness estimates are much higher for the global dataset, regardless of rarefaction method. This result demonstrates the species–area effect described in the introduction: even when rarefaction is applied to the same quota or coverage level, diversity estimates are larger for datasets with greater spatial coverage, because rarefaction fails to control spatial turnover in diversity.

unsamp <- sdSumry(dat = bivalves, 
                  taxVar = 'genus',
                  collections = 'collection_no',
                  xy = xyCell,
                  quotaQ = 0.8, quotaN = 100,
                  omitDom = TRUE,
                  crs = prj)
#>   nColl nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
#> 1   863 3061  157 -434404.9   1647720 137.4159      28917.12     11375.07
#>   minSpanTree     SCOR nTax  evenness coverage  SQSdiv SQSlow95 SQSupr95
#> 1    93349.91 20.64401  550 0.9061343   0.9414 323.015 310.6638 335.3662
#>      CRdiv  CRlow95  CRupr95
#> 1 470.9424 454.3334 487.5515
samp1 <- sdSumry(dat = circOccs[[1]], 
                 taxVar = 'genus',
                 collections = 'collection_no',
                 xy = xyCell,
                 quotaQ = 0.8, quotaN = 100, 
                 omitDom = TRUE,
                 crs = prj)
#>   nColl nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
#> 1   134  442   12  -7027292   4139440 10.25901      1612.452     779.4877
#>   minSpanTree     SCOR nTax  evenness coverage  SQSdiv SQSlow95 SQSupr95
#> 1    2648.528 46.01033  164 0.9572183   0.8704 136.402 123.5963 149.2078
#>      CRdiv CRlow95  CRupr95
#> 1 219.8183 190.056 249.5806

Iterative spatial subsampling addresses this issue in that it can control for beta diversity in global occurrence datasets. Even though any individual subsample includes only a fraction of the total occurrences, all data can contribute to analyses through repeated subsampling. Ecological variables calculated on any given subsample are directly comparable to those from other subsamples on the same dataset or a comparison dataset (e.g. a different time step or habitat type). The code chunk below demonstrates how to calculate summary statistics over all subsamples of Pliocene bivalve occurrences. The median taxon count in a 1500km-radius subsample is 162, with an interquartile range of 126–183 taxa. This range can be interpreted loosely as primary regional variation in estimated palaeodiversity.

# warning - it's slow to rarefy diversity for hundreds of subsamples!
# this code chunk skips it for quick demonstration purposes
sampsMeta <- sdSumry(dat = circOccs, 
                     taxVar = 'genus',
                     collections = 'collection_no',
                     xy = xyCell,
                     crs = prj
quantile(sampsMeta$nTax, c(0.25, 0.5, 0.75))
#> 25% 50% 75% 
#> 126 162 183


Alroy, John. 2010. “The Shifting Balance of Diversity Among Major Marine Animal Groups.” Science 329 (5996): 1191–94.
———. 2014. “Accurate and Precise Estimates of Origination and Extinction Rates.” Paleobiology 40 (3): 374–97.
Alroy, John, Martin Aberhan, David J Bottjer, Michael Foote, Franz T Fürsich, Peter J Harries, Austin J W Hendy, et al. 2008. “Phanerozoic Trends in the Global Diversity of Marine Invertebrates.” Journal Article. Science 321 (5885): 97–100.
Antell, Gawain T, Roger BJ Benson, and Erin E Saupe. accepted. “Spatial Standardization of Taxon Occurrence Data—a Call to Action.” Journal Article. Paleobiology, accepted.
Antell, Gawain T, Wolfgang Kiessling, Martin Aberhan, and Erin E Saupe. 2020. “Marine Biodiversity and Geographic Distributions Are Independent on Large Scales.” Journal Article. Current Biology 30 (1): 115–21.
Benson, Roger BJ, Richard Butler, Roger A Close, Erin Saupe, and Daniel L Rabosky. 2021. “Biodiversity Across Space and Time in the Fossil Record.” Current Biology 31 (19): R1225–36.
Chao, Anne, and Lou Jost. 2012. “Coverage-Based Rarefaction and Extrapolation: Standardizing Samples by Completeness Rather Than Size.” Ecology 93 (12): 2533–47.
Close, Roger A, Roger B J Benson, Erin E Saupe, Michael E Clapham, and Richard J Butler. 2020. “The Spatial Structure of Phanerozoic Marine Animal Diversity.” Journal Article. Science 368 (6489): 420–24.
Close, Roger A, Roger B J Benson, Paul Upchurch, and Richard J Butler. 2017. “Controlling for the Species-Area Effect Supports Constrained Long-Term Mesozoic Terrestrial Vertebrate Diversification.” Journal Article. Nature Communications 8.

  1. If vector and raster data are new concepts, a well-curated resource for learning the essentials of working with spatial data in R is↩︎