--- title: "dynamicSDM: Response variable data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{dynamicSDM: Response variable data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction *dynamicSDM* offers novel tools for developing species geographical distribution and abundance modelling (SDM) at high spatiotemporal resolution. Across the dynamicSDM vignettes, we demonstrate package functions on a case study species and guide users through the four key species distribution modelling stages: + Stage 1: Organising response variable data including species occurrence and pseudo-absence records + Stage 2: Extracting spatio-temporally buffered explanatory variables + Stage 3: Fitting SDMs whilst accounting for spatial and temporal autocorrelation + Stage 4: Projecting intra- and inter-annual geographical distributions and abundances at high spatiotemporal resolution. ## Installation Before starting this tutorial, the *dynamicSDM* package needs to be installed. You can install the latest version from [GitHub](https://github.com/r-a-dobson/dynamicSDM) with: ```{r install package GitHub} # devtools::install_github('r-a-dobson/dynamicSDM') ``` or from [CRAN](https://CRAN.R-project.org/package=dynamicSDM) with ```{r install package CRAN} #install.packages("dynamicSDM") library(dynamicSDM) library(terra) ``` Due to the utilisation of Google Earth Engine and Google Drive across *dynamicSDM* functions, you must also set up a free Google account at [Google]( https://www.google.com/account/about/). Once you have your Google log-in details, these can be used to authorise Google Earth Engine in the `rgee` package and Google Drive in the `googledrive` package. ```{r check Google, eval=FALSE } library(rgee) #rgee::ee_install() rgee::ee_check() library(googledrive) #googledrive::drive_auth_configure() googledrive::drive_user() ``` ## Directory organisation As we continue through the species distribution modelling framework, functions will generate a range of outputs. To keep organised, we recommend a structured directory system. The code below will create a new project directory in your temporary folders, but you can edit the script to specify your home directory if you want to retain function outputs between R sessions. ```{r create directory} project_directory <- file.path(file.path(tempdir(), "dynamicSDM_vignette")) dir.create(project_directory) ``` ## Case study: the red-billed quelea In this four stage tutorial, we will be studying the case study species, the red-billed quelea (Quelea quelea), a nomadic bird inhabiting sub-Saharan Africa. With highly variable distributions driven by short-term weather and resource availability, *dynamicSDM* provides key functions to accurately model quelea’s distribution through space and time. ## Stage 1: Response variable data In this tutorial, we will be processing a sample of species occurrence records for the red-billed quelea (Quelea quelea) to generate an SDM response variable data frame. This will involve filtering records by spatio-temporal quality, extent and resolution; testing and accounting for spatio-temporal biases in records, and generating pseudo-absences through space and time. ### a) Import occurrence records A sample of red-billed quelea occurrence data can be loaded into your local R environment using the code below. ```{r import occurrence} data("sample_occ_data") ``` These records are formatted as exported from the Global Biodiversity Information Facility, with columns labelled “decimalLongitude” and “decimalLatitude”. For *dynamicSDM* functions, occurrence data must be in a specific format; containing numeric data in columns labelled “x” and “y” for the record co-ordinates longitude and latitude, and “day”, “month” and “year” for the record’s date. `convert_gbif()` can be used to convert between these formats, or this can be achieved manually for data not extracted from GBIF. ```{r example-convert_gbif} sample_occ <- convert_gbif(sample_occ_data) ``` ### b) Filter occurrence records #### Validity and completeness Species occurrence records may contain anomalous or missing values in the co-ordinates or dates given. `spatiotemp_check()` can be used to exclude records with missing (NA) values (argument `na.handle`), invalid co-ordinates (`coord.handle`), invalid dates (`date.handle`) and duplicate records (`duplicate.handle`). Here, we specify `date.res = "day"` to check all date columns from year to day. If you only want to check the year of record dates, then you could specify `date.res = "year"`. ```{r example-spatiotemp_check} sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ, na.handle = "exclude", date.handle = "exclude", date.res = "day", coord.handle = "exclude", duplicate.handle = "exclude") nrow(sample_occ_filtered) ``` Additionally, users can opt to take advantage of the `CoordinateCleaner` package functions to exclude records flagged as containing potentially anomalous co-ordinates, based on a wider variety of spatial tests than *dynamicSDM* alone offers. ```{r example-spatiotemp_check with Coordinate Cleaner,eval=F} sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ, na.handle = "exclude", date.handle = "exclude", date.res = "day", coord.handle = "exclude", duplicate.handle = "exclude", coordclean = TRUE, coordclean.species = "quelea", coordclean.handle = "exclude") ``` #### Spatiotemporal extent When generating dynamic species distribution models, it is important that occurrence records match the spatio-temporal extent of the study and any remote-sensing datasets to be utilised as explanatory variables. `spatiotemporal_extent()` removes any records outside of the set extents. In this case study, we want to model the subspecies Q. q. lathamii which is typically limited to southern Africa and we will incorporate remote sensing datasets available for the years 2001 to 2019. To filter records to these extents, we use a polygon of southern African countries, which can be read into your R environment and applied using the following code. ```{r example-spatiotemp_extent,fig.width=4, fig.height=4} data("sample_extent_data") sample_occ_cropped <- spatiotemp_extent(occ.data = sample_occ_filtered, temporal.ext = c("2001-01-01", "2018-12-01"), spatial.ext = terra::ext(sample_extent_data), prj = "+proj=longlat +datum=WGS84 +no_defs") ## Lets plot the change #terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(sample_occ_filtered[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs")) #terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs")) ``` #### Spatiotemporal resolution The spatial resolution of occurrence record co-ordinates and the temporal resolution of occurrence record dates are also important when generating dynamic SDMs. `spatiotemp_resolution()` can filter records by set thresholds. As quelea migrations are driven by highly local and short-term environmental factors, we will filter to exclude records not given to the specific day and co-ordinates to four decimal places. ```{r example-spatiotemp_resolution} sample_occ_cropped<-spatiotemp_resolution(occ.data = sample_occ_cropped, spatial.res = 4, temporal.res = "day") nrow(sample_occ_cropped) # Even more records have been removed! ``` ### c) Test for spatial and temporal sampling biases For many reasons, the collection of species occurrence records are biased through space and time. These biases can impact SDM performance by over- or under- representing conditions at a given site or time. `spatiotemp_bias()` returns the results of statistical tests and plots for visual assessment of such biases. The code below tests for bias across the entire dataset and species range. ```{r example-spatiotemp_bias simple} bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped, temporal.level = c("year"), plot = FALSE, spatial.method = "simple", prj = "+proj=longlat +datum=WGS84") bias_results ``` Whereas, this code will test for bias across only the core of the species range. ```{r example-spatiotemp_bias core } bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped, temporal.level = c("year"), plot = FALSE, spatial.method = "core", prj = "+proj=longlat +datum=WGS84") bias_results ``` Both approaches can generate important results. The latter may be a more reliable indicator if the species range is expanding or shifting in response to global change. This is because records at the range periphery that have been collected randomly may still not be evenly distributed through space and time due to underlying ecological processes. Therefore, testing for biases on the species range core may be more representative of true spatial and temporal sampling biases. ### d) Account for spatial and temporal sampling biases #### Occurrence record thinning One approach to correct spatio-temporal sampling biases are to “thin” records, which involves excluding records that have been collected too closely in space or time. `spatiotemp_thin()` allows users to specify the spatial and temporal distance to thin records by. An important argument of `spatiotemp_thin()` is `spatial.split.degrees` which specifies the size of grid cells to split occurrence records into before temporal thinning. This prevents spatially distant but temporally close records from being filtered out. ```{r example-spatiotemp_thin,fig.width=4, fig.height=4} occ_thin <- spatiotemp_thin(occ.data = sample_occ_cropped, temporal.method = "day", temporal.dist = 30, spatial.split.degrees = 3, spatial.dist = 100000, iterations = 5) # Plot the difference in spatial distribution after thinning # Non-thinned terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T) # Thinned terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(occ_thin[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T) ``` #### Weight records by sampling effort Another approach to account for spatio-temporal sampling biases is to weight records by sampling effort when fitting SDMs.This results in the downweighting of the importance of records from over-sampled regions and vice versa. Here, we will read in a sample of e-Bird avian sampling records in southern Africa to use as a proxy for sampling effort, and sum these across a spatial and temporal buffer from each occurrence record using `spatiotemp_weights()`. ```{r example-spatiotemp_weights} data("sample_events_data") sample_occ_cropped <- spatiotemp_weights(occ.data = sample_occ_cropped, samp.events = sample_events_data, spatial.dist = 100000, temporal.dist = 30, prj = "+proj=longlat +datum=WGS84") ``` ### e) Generate pseudo-absences through space and time Often the paucity of species absence records necessitates the generation of pseudo-absences, which are inferred absences based upon presence records. When modelling a species distribution at high spatiotemporal resolution, it may be best to generate pseudo-absences within close spatial and temporal buffers of species occurrence. This is because model presence-absence comparisons are made at fine scales, instead of coarsely between biomes and seasons. Using `spatiotemp_pseudoabs()` you can specify the spatial and temporal buffer size or extents to randomly generate pseudo-absences within. ```{r example-spatiotemp_pseudoabs,fig.width=4, fig.height=4} # Pseudo-absences generated within spatial and temporal buffer pseudo_abs_buff <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped, spatial.method = "buffer", temporal.method = "buffer", spatial.ext = sample_extent_data, spatial.buffer = c(250000, 500000), temporal.buffer = c(42, 84), n.pseudoabs = nrow(sample_occ_cropped)) # Pseudo-absences generated randomly across given spatial and temporal extent pseudo_abs_rand <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped, spatial.method = "random", temporal.method = "random", spatial.ext = sample_extent_data, temporal.ext = c("2002-01-01", "2019-12-01"), n.pseudoabs = nrow(sample_occ_cropped)) # Plot the spatial distribution of pseudo-absences (red) compared to occurrence records for: # Buffered terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T) terra::plot(terra::vect(pseudo_abs_buff[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T) # Random terra::plot(sample_extent_data$geometry) terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T) terra::plot(terra::vect(pseudo_abs_rand[, c("x", "y")], geom = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T) ``` Then, the generated pseudo-absences need to be added to the occurrence record data frame by adding relevant columns and binding the rows together. ```{r complete_dataset,eval=F} pseudo_abs_buff$decimalLatitude <- pseudo_abs_buff$y pseudo_abs_buff$decimalLongitude <- pseudo_abs_buff$x pseudo_abs_buff$occurrenceStatus <- rep("absent", nrow(pseudo_abs_buff)) pseudo_abs_buff$source <- rep("dynamicSDM", nrow(pseudo_abs_buff)) pseudo_abs_buff<-spatiotemp_weights(occ.data = pseudo_abs_buff, samp.events = sample_events_data, spatial.dist = 100000, temporal.dist = 30, prj = "+proj=longlat +datum=WGS84") complete.dataset <- as.data.frame(rbind(sample_occ_cropped, pseudo_abs_buff)) ``` ## Summary At the end of this vignette, we now have a complete data frame of filtered species occurrence and pseudo-absence records of appropriate spatio-temporal quality, extent and resolution. Let’s save this to our project directory for use in the next tutorial! ```{r export_dataset,eval=F} write.csv(complete.dataset, file = paste0(project_directory, "/filtered_quelea_occ.csv")) ```