Title: | Survey Value of Information |
---|---|
Description: | Decision support tool for prioritizing sites for ecological surveys based on their potential to improve plans for conserving biodiversity (e.g. plans for establishing protected areas). Given a set of sites that could potentially be acquired for conservation management, it can be used to generate and evaluate plans for surveying additional sites. Specifically, plans for ecological surveys can be generated using various conventional approaches (e.g. maximizing expected species richness, geographic coverage, diversity of sampled environmental algorithms. After generating such survey plans, they can be evaluated using conditions) and maximizing value of information. Please note that several functions depend on the 'Gurobi' optimization software (available from <https://www.gurobi.com>). Additionally, the 'JAGS' software (available from <https://mcmc-jags.sourceforge.io/>) is required to fit hierarchical generalized linear models. For further details, see Hanson et al. (2022) <doi:10.1111/1365-2664.14309>. |
Authors: | Jeffrey O Hanson [aut, cre] , Iadine Chadès [aut] , Emma J Hudgins [aut] , Joseph R Bennett [aut] |
Maintainer: | Jeffrey O Hanson <[email protected]> |
License: | GPL-3 |
Version: | 1.0.6 |
Built: | 2024-11-25 06:58:47 UTC |
Source: | CRAN |
Calculate the expected value of the management decision given survey information. This metric describes the value of the management decision that is expected when the decision maker conducts a surveys a set of sites to inform the decision. To speed up the calculations, an approximation method is used.
approx_evdsi( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_scheme_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500 )
approx_evdsi( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_scheme_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500 )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
site_survey_scheme_column |
|
site_survey_cost_column |
|
feature_survey_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
prior_matrix |
|
n_approx_replicates |
|
n_approx_outcomes_per_replicate |
|
seed |
|
This function uses approximation methods to estimate the
expected value calculations. The accuracy of these
calculations depend on the arguments to
n_approx_replicates
and n_approx_outcomes_per_replicate
, and
so you may need to increase these parameters for large problems.
A numeric
vector containing the expected values for each
replicate.
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # create a survey scheme that samples the first two sites that # are missing data sim_sites$survey_site <- FALSE sim_sites$survey_site[which(sim_sites$n1 < 0.5)[1:2]] <- TRUE # calculate expected value of management decision given the survey # information using approximation method approx_ev_survey <- approx_evdsi( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_site", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print mean value print(mean(approx_ev_survey))
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # create a survey scheme that samples the first two sites that # are missing data sim_sites$survey_site <- FALSE sim_sites$survey_site[which(sim_sites$n1 < 0.5)[1:2]] <- TRUE # calculate expected value of management decision given the survey # information using approximation method approx_ev_survey <- approx_evdsi( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_site", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print mean value print(mean(approx_ev_survey))
Find a near optimal survey scheme that maximizes value of information. This function uses the approximation method for calculating the expected value of the decision given a survey scheme, and a greedy heuristic algorithm to maximize this metric.
approx_near_optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500, n_threads = 1, verbose = FALSE )
approx_near_optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500, n_threads = 1, verbose = FALSE )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
site_survey_cost_column |
|
feature_survey_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
survey_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
site_survey_locked_out_column |
|
prior_matrix |
|
n_approx_replicates |
|
n_approx_outcomes_per_replicate |
|
seed |
|
n_threads |
|
verbose |
|
Ideally, the brute-force algorithm would be used to identify the optimal survey scheme. Unfortunately, it is not feasible to apply the brute-force to large problems because it can take an incredibly long time to complete. In such cases, it may be desirable to obtain a "relatively good" survey scheme and the greedy heuristic algorithm is provided for such cases. The greedy heuristic algorithm – unlike the brute force algorithm – is not guaranteed to identify an optimal solution – or even a "relatively good solution" for that matter – though greedy heuristic algorithms tend to deliver solutions that are 15\ greedy algorithms is implemented as:
Initialize an empty list of survey scheme solutions, and an empty list of approximate expected values.
Calculate the expected value of current information.
Add a survey scheme with no sites selected for surveying to the list of survey scheme solutions, and add the expected value of current information to the list of approximate expected values.
Set the current survey solution as the survey scheme with no sites selected for surveying.
For each remaining candidate site that has not been selected for a survey, generate a new candidate survey scheme with each candidate site added to the current survey solution.
Calculate the approximate expected value of each
new candidate survey scheme. If the cost of a given candidate survey scheme
exceeds the survey budget, then store a missing NA value
instead.
Also if the the cost of a given candidate survey scheme plus the
management costs of locked in planning units exceeds the total budget,
then a store a missing value NA
value too.
If all of the new candidate survey schemes are associated with
missing NA
values – because they all exceed the survey budget – then
go to step 12.
Calculate the cost effectiveness of each new candidate survey scheme. This calculated as the difference between the approximate expected value of a given new candidate survey scheme and that of the current survey solution, and dividing this difference by the the cost of the newly selected candidate site.
Find the new candidate survey scheme that is associated with the
highest cost-effectiveness value, ignoring any missing NA
values.
This new candidate survey scheme is now set as the
current survey scheme.
Store the current survey scheme in the list of survey scheme solutions and store its approximate expected value in the list of approximate expected values.
Go to step 12.
Find the solution in the list of survey scheme solutions that has the highest expected value in the list of approximate expected values and return this solution.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in the scheme or not. Columns
correspond to sites, and rows correspond to different schemes. If there
are no ties for the best identified solution, then the the matrix
will only contain a single row.
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of managing all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 # find survey scheme using approximate method and greedy heuristic algorithm # (using 10 replicates so that this example completes relatively quickly) approx_near_optimal_survey <- approx_near_optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(approx_near_optimal_survey)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of managing all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 # find survey scheme using approximate method and greedy heuristic algorithm # (using 10 replicates so that this example completes relatively quickly) approx_near_optimal_survey <- approx_near_optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(approx_near_optimal_survey)
Find the optimal survey scheme that maximizes value of information. This function uses the approximation method for calculating the expected value of the decision given a survey scheme.
approx_optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500, n_threads = 1, verbose = FALSE )
approx_optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_approx_replicates = 100, n_approx_outcomes_per_replicate = 10000, seed = 500, n_threads = 1, verbose = FALSE )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
site_survey_cost_column |
|
feature_survey_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
survey_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
site_survey_locked_out_column |
|
prior_matrix |
|
n_approx_replicates |
|
n_approx_outcomes_per_replicate |
|
seed |
|
n_threads |
|
verbose |
|
The "approximately" optimal survey scheme is determined using a brute-force
algorithm.
Initially, all feasible (valid) survey schemes are identified given the
survey costs and the survey budget (using
feasible_survey_schemes()
. Next, the expected value of each and
every feasible survey scheme is approximated
(using approx_evdsi()
).
Finally, the greatest expected value is identified, and all survey schemes
that share this greatest expected value are returned. Due to the nature of
this algorithm, it can take a very long time to complete.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in the scheme or not. Columns
correspond to sites, and rows correspond to different schemes. If
there is only one optimal survey scheme then the matrix
will only
contain a single row.
This matrix also has a numeric
"ev"
attribute that contains a matrix with the approximate expected values.
Within this attribute, each row corresponds to a different survey scheme
and each column corresponds to a different replicate.
Please note that this function requires the Gurobi optimization software (https://www.gurobi.com/) and the gurobi R package if different sites have different survey costs. Installation instruction are available online for Linux, Windows, and Mac OS (see https://support.gurobi.com/hc/en-us/articles/4534161999889-How-do-I-install-Gurobi-Optimizer).
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of surveying all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 ## Not run: # find optimal survey scheme using approximate method # (using 10 replicates so that this example completes relatively quickly) approx_opt_survey <- approx_optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(approx_opt_survey) ## End(Not run)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of surveying all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 ## Not run: # find optimal survey scheme using approximate method # (using 10 replicates so that this example completes relatively quickly) approx_opt_survey <- approx_optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(approx_opt_survey) ## End(Not run)
Generate a survey scheme by maximizing the diversity of environmental conditions that are surveyed.
env_div_survey_scheme( site_data, cost_column, survey_budget, env_vars_columns, method = "mahalanobis", locked_in_column = NULL, locked_out_column = NULL, exclude_locked_out = FALSE, solver = "auto", verbose = FALSE )
env_div_survey_scheme( site_data, cost_column, survey_budget, env_vars_columns, method = "mahalanobis", locked_in_column = NULL, locked_out_column = NULL, exclude_locked_out = FALSE, solver = "auto", verbose = FALSE )
site_data |
|
cost_column |
|
survey_budget |
|
env_vars_columns |
|
method |
|
locked_in_column |
|
locked_out_column |
|
exclude_locked_out |
|
solver |
|
verbose |
|
The integer programming formulation of the environmental diversity reserve selection problem (Faith & Walker 1996) is used to generate survey schemes.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in a scheme or not. Columns
correspond to sites, and rows correspond to different schemes.
This function can use the Rsymphony package and the Gurobi optimization software to generate survey schemes. Although the Rsymphony package is easier to install because it is freely available on the The Comprehensive R Archive Network (CRAN), it is strongly recommended to install the Gurobi optimization software and the gurobi R package because it can generate survey schemes much faster. Note that special academic licenses are available at no cost. Installation instructions are available online for Linux, Windows, and Mac OS operating systems.
Faith DP & Walker PA (1996) Environmental diversity: on the best-possible use of surrogate data for assessing the relative biodiversity of sets of areas. Biodiversity & Conservation, 5, 399–415.
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), v1 = c(0.1, 0.2, 0.3, 10), # environmental axis 1 v2 = c(0.1, 0.2, 0.3, 10), # environmental axis 2 cost = rep(1, 4)), coords = c("x", "y")) # plot the sites' environmental conditions plot(x[, c("v1", "v2")], pch = 16, cex = 3) # generate scheme with a budget of 2 s <- env_div_survey_scheme(x, "cost", 2, c("v1", "v2"), "mahalanobis") # print scheme print(s) # plot scheme x$scheme <- c(s) plot(x[, "scheme"], pch = 16, cex = 3)
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), v1 = c(0.1, 0.2, 0.3, 10), # environmental axis 1 v2 = c(0.1, 0.2, 0.3, 10), # environmental axis 2 cost = rep(1, 4)), coords = c("x", "y")) # plot the sites' environmental conditions plot(x[, c("v1", "v2")], pch = 16, cex = 3) # generate scheme with a budget of 2 s <- env_div_survey_scheme(x, "cost", 2, c("v1", "v2"), "mahalanobis") # print scheme print(s) # plot scheme x$scheme <- c(s) plot(x[, "scheme"], pch = 16, cex = 3)
Calculate the expected value of the management decision given current information. This metric describes the value of the management decision that is expected when the decision maker is limited to existing biodiversity data (i.e. survey data and environmental niche models).
evdci( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL )
evdci( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
prior_matrix |
|
This function calculates the expected value and does not use approximation methods. As such, this function can only be applied to very small problems.
A numeric
value.
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # calculate expected value of management decision given current information # using exact method ev_current <- evdci( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print exact value print(ev_current)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # calculate expected value of management decision given current information # using exact method ev_current <- evdci( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print exact value print(ev_current)
Calculate the expected value of the management decision given survey information. This metric describes the value of the management decision that is expected when the decision maker surveys a set of sites to help inform the decision.
evdsi( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_scheme_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL )
evdsi( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_scheme_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, prior_matrix = NULL )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
site_survey_scheme_column |
|
site_survey_cost_column |
|
feature_survey_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
prior_matrix |
|
This function calculates the expected value and does not use approximation methods. As such, this function can only be applied to very small problems.
A numeric
value.
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # create a survey scheme that samples the first two sites that # are missing data sim_sites$survey_site <- FALSE sim_sites$survey_site[which(sim_sites$n1 < 0.5)[1:2]] <- TRUE # calculate expected value of management decision given the survey # information using exact method ev_survey <- evdsi( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_site", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print value print(ev_survey)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # create a survey scheme that samples the first two sites that # are missing data sim_sites$survey_site <- FALSE sim_sites$survey_site[which(sim_sites$n1 < 0.5)[1:2]] <- TRUE # calculate expected value of management decision given the survey # information using exact method ev_survey <- evdsi( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_site", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget) # print value print(ev_survey)
Generate a matrix
representing all possible different
survey schemes given survey costs and a fixed budget.
feasible_survey_schemes( site_data, cost_column, survey_budget, locked_in_column = NULL, locked_out_column = NULL, verbose = FALSE )
feasible_survey_schemes( site_data, cost_column, survey_budget, locked_in_column = NULL, locked_out_column = NULL, verbose = FALSE )
site_data |
|
cost_column |
|
survey_budget |
|
locked_in_column |
|
locked_out_column |
|
verbose |
|
A matrix
where each row corresponds to a different
survey scheme, and each column corresponds to a different planning unit.
Cell values are logical
(TRUE
/ FALSE
) indicating
if a given site is selected in a given survey scheme.
Please note that this function requires the Gurobi optimization software (https://www.gurobi.com/) and the gurobi R package if different sites have different survey costs. Installation instruction are available online for Linux, Windows, and Mac OS (see https://support.gurobi.com/hc/en-us/articles/4534161999889-How-do-I-install-Gurobi-Optimizer).
## Not run: # set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf(tibble::tibble(x = rnorm(4), y = rnorm(4), cost = c(100, 200, 0.2, 1)), coords = c("x", "y")) # print data print(x) # plot site locations plot(st_geometry(x), pch = 16, cex = 3) # generate all feasible schemes given a budget of 4 s <- feasible_survey_schemes(x, "cost", survey_budget = 4) # print schemes print(s) # plot first scheme x$scheme_1 <- s[1, ] plot(x[, "scheme_1"], pch = 16, cex = 3) ## End(Not run)
## Not run: # set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf(tibble::tibble(x = rnorm(4), y = rnorm(4), cost = c(100, 200, 0.2, 1)), coords = c("x", "y")) # print data print(x) # plot site locations plot(st_geometry(x), pch = 16, cex = 3) # generate all feasible schemes given a budget of 4 s <- feasible_survey_schemes(x, "cost", survey_budget = 4) # print schemes print(s) # plot first scheme x$scheme_1 <- s[1, ] plot(x[, "scheme_1"], pch = 16, cex = 3) ## End(Not run)
Estimate probability of occupancy for a set of features in a set of
planning units. Models are fitted as hierarchical generalized linear models
that account for for imperfect detection (following Royle & Link 2006)
using JAGS (via runjags::run.jags()
). To limit over-fitting,
covariate coefficients are sampled using a Laplace prior distribution
(equivalent to L1 regularization used in machine learning contexts)
(Park & Casella 2008).
fit_hglm_occupancy_models( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_env_vars_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, jags_n_samples = rep(10000, length(site_detection_columns)), jags_n_burnin = rep(1000, length(site_detection_columns)), jags_n_thin = rep(100, length(site_detection_columns)), jags_n_adapt = rep(1000, length(site_detection_columns)), jags_n_chains = rep(4, length(site_detection_columns)), n_folds = rep(5, length(site_detection_columns)), n_threads = 1, seed = 500, verbose = FALSE )
fit_hglm_occupancy_models( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_env_vars_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, jags_n_samples = rep(10000, length(site_detection_columns)), jags_n_burnin = rep(1000, length(site_detection_columns)), jags_n_thin = rep(100, length(site_detection_columns)), jags_n_adapt = rep(1000, length(site_detection_columns)), jags_n_chains = rep(4, length(site_detection_columns)), n_folds = rep(5, length(site_detection_columns)), n_threads = 1, seed = 500, verbose = FALSE )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_env_vars_columns |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
jags_n_samples |
|
jags_n_burnin |
|
jags_n_thin |
|
jags_n_adapt |
|
jags_n_chains |
|
n_folds |
|
n_threads |
|
seed |
|
verbose |
|
This function (i) prepares the data for model fitting, (ii) fits the models, and (iii) assesses the performance of the models. These analyses are performed separately for each feature. For a given feature:
The data are prepared for model fitting by partitioning the data using
k-fold cross-validation (set via argument to n_folds
). The
training and evaluation folds are constructed
in such a manner as to ensure that each training and evaluation
fold contains at least one presence and one absence observation.
A model for fit separately for each fold (see
inst/jags/model.jags
for model code). To assess convergence,
the multi-variate potential scale reduction factor
(MPSRF) statistic is calculated for each model.
The performance of the cross-validation models is evaluated.
Specifically, the TSS, sensitivity, and specificity statistics are
calculated (if relevant, weighted by the argument to
site_weights_data
). These performance values are calculated using
the models' training and evaluation folds. To assess convergence,
the maximum MPSRF statistic for the models fit for each feature
is calculated.
A list
object containing:
list
of list
objects containing the models.
tibble::tibble()
object containing
predictions for each feature.
tibble::tibble()
object containing the
performance of the best models for each feature. It contains the following
columns:
name of the feature.
maximum multi-variate potential scale reduction factor (MPSRF) value for the models. A MPSRF value less than 1.05 means that all coefficients in a given model have converged, and so a value less than 1.05 in this column means that all the models fit for a given feature have successfully converged.
mean TSS statistic for models calculated using training data in cross-validation.
standard deviation in TSS statistics for models calculated using training data in cross-validation.
mean sensitivity statistic for models calculated using training data in cross-validation.
standard deviation in sensitivity statistics for models calculated using training data in cross-validation.
mean specificity statistic for models calculated using training data in cross-validation.
standard deviation in specificity statistics for models calculated using training data in cross-validation.
mean TSS statistic for models calculated using test data in cross-validation.
standard deviation in TSS statistics for models calculated using test data in cross-validation.
mean sensitivity statistic for models calculated using test data in cross-validation.
standard deviation in sensitivity statistics for models calculated using test data in cross-validation.
mean specificity statistic for models calculated using test data in cross-validation.
standard deviation in specificity statistics for models calculated using test data in cross-validation.
This function requires the JAGS software to be installed. For information on installing the JAGS software, please consult the documentation for the rjags package.
Park T & Casella G (2008) The Bayesian lasso. Journal of the American Statistical Association, 103: 681–686.
Royle JA & Link WA (2006) Generalized site occupancy models allowing for false positive and false negative errors. Ecology, 87: 835–841.
## Not run: # set seeds for reproducibility set.seed(123) # simulate data for 200 sites, 2 features, and 3 environmental variables site_data <- simulate_site_data(n_sites = 30, n_features = 2, prop = 0.1) feature_data <- simulate_feature_data(n_features = 2, prop = 1) # print JAGS model code cat(readLines(system.file("jags", "model.jags", package = "surveyvoi")), sep = "\n") # fit models # note that we use a small number of MCMC iterations so that the example # finishes quickly, you probably want to use the defaults for real work results <- fit_hglm_occupancy_models( site_data, feature_data, c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"), "survey_sensitivity", "survey_specificity", n_folds = rep(5, 2), jags_n_samples = rep(250, 2), jags_n_burnin = rep(250, 2), jags_n_thin = rep(1, 2), jags_n_adapt = rep(100, 2), n_threads = 1) # print model predictions print(results$predictions) # print model performance print(results$performance, width = Inf) ## End(Not run)
## Not run: # set seeds for reproducibility set.seed(123) # simulate data for 200 sites, 2 features, and 3 environmental variables site_data <- simulate_site_data(n_sites = 30, n_features = 2, prop = 0.1) feature_data <- simulate_feature_data(n_features = 2, prop = 1) # print JAGS model code cat(readLines(system.file("jags", "model.jags", package = "surveyvoi")), sep = "\n") # fit models # note that we use a small number of MCMC iterations so that the example # finishes quickly, you probably want to use the defaults for real work results <- fit_hglm_occupancy_models( site_data, feature_data, c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"), "survey_sensitivity", "survey_specificity", n_folds = rep(5, 2), jags_n_samples = rep(250, 2), jags_n_burnin = rep(250, 2), jags_n_thin = rep(1, 2), jags_n_adapt = rep(100, 2), n_threads = 1) # print model predictions print(results$predictions) # print model performance print(results$performance, width = Inf) ## End(Not run)
Estimate probability of occupancy for a set of features in a set of
planning units. Models are fitted using gradient boosted trees (via
xgboost::xgb.train()
).
fit_xgb_occupancy_models( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_env_vars_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, xgb_tuning_parameters, xgb_early_stopping_rounds = rep(20, length(site_detection_columns)), xgb_n_rounds = rep(100, length(site_detection_columns)), n_folds = rep(5, length(site_detection_columns)), n_threads = 1, seed = 500, verbose = FALSE )
fit_xgb_occupancy_models( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_env_vars_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, xgb_tuning_parameters, xgb_early_stopping_rounds = rep(20, length(site_detection_columns)), xgb_n_rounds = rep(100, length(site_detection_columns)), n_folds = rep(5, length(site_detection_columns)), n_threads = 1, seed = 500, verbose = FALSE )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_env_vars_columns |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
xgb_tuning_parameters |
|
xgb_early_stopping_rounds |
|
xgb_n_rounds |
|
n_folds |
|
n_threads |
|
seed |
|
verbose |
|
This function (i) prepares the data for model fitting, (ii) calibrates
the tuning parameters for model fitting (see xgboost::xgb.train()
for details on tuning parameters), (iii) generate predictions using
the best found tuning parameters, and (iv) assess the performance of the
best supported models. These analyses are performed separately for each
feature. For a given feature:
The data are prepared for model fitting by partitioning the data using
k-fold cross-validation (set via argument to n_folds
). The
training and evaluation folds are constructed
in such a manner as to ensure that each training and evaluation
fold contains at least one presence and one absence observation.
A grid search method is used to tune the model parameters. The
candidate values for each parameter (specified via parameters
) are
used to generate a full set of parameter combinations, and these
parameter combinations are subsequently used for tuning the models.
To account for unbalanced datasets, the
scale_pos_weight
xgboost::xgboost()
parameter
is calculated as the mean value across each of the training folds
(i.e. number of absence divided by number of presences per feature).
For a given parameter combination, models are fit using k-fold cross-
validation (via xgboost::xgb.cv()
) – using the previously
mentioned training and evaluation folds – and the True Skill Statistic
(TSS) calculated using the data held out from each fold is
used to quantify the performance (i.e. "test_tss_mean"
column in
output). These models are also fitted using the
early_stopping_rounds
parameter to reduce time-spent
tuning models. If relevant, they are also fitted using the supplied weights
(per by the argument to site_weights_data
). After exploring the
full set of parameter combinations, the best parameter combination is
identified, and the associated parameter values and models are stored for
later use.
The cross-validation models associated with the best parameter combination are used to generate predict the average probability that the feature occupies each site. These predictions include sites that have been surveyed before, and also sites that have not been surveyed before.
The performance of the cross-validation models is evaluated.
Specifically, the TSS, sensitivity, and specificity statistics are
calculated (if relevant, weighted by the argument to
site_weights_data
). These performance values are calculated using
the models' training and evaluation folds.
A list
object containing:
list
of list
objects containing the best
tuning parameters for each feature.
tibble::tibble()
object containing
predictions for each feature.
tibble::tibble()
object containing the
performance of the best models for each feature. It contains the following
columns:
name of the feature.
mean TSS statistic for models calculated using training data in cross-validation.
standard deviation in TSS statistics for models calculated using training data in cross-validation.
mean sensitivity statistic for models calculated using training data in cross-validation.
standard deviation in sensitivity statistics for models calculated using training data in cross-validation.
mean specificity statistic for models calculated using training data in cross-validation.
standard deviation in specificity statistics for models calculated using training data in cross-validation.
mean TSS statistic for models calculated using test data in cross-validation.
standard deviation in TSS statistics for models calculated using test data in cross-validation.
mean sensitivity statistic for models calculated using test data in cross-validation.
standard deviation in sensitivity statistics for models calculated using test data in cross-validation.
mean specificity statistic for models calculated using test data in cross-validation.
standard deviation in specificity statistics for models calculated using test data in cross-validation.
## Not run: # set seeds for reproducibility set.seed(123) # simulate data for 30 sites, 2 features, and 3 environmental variables site_data <- simulate_site_data( n_sites = 30, n_features = 2, n_env_vars = 3, prop = 0.1) feature_data <- simulate_feature_data(n_features = 2, prop = 1) # create list of possible tuning parameters for modeling parameters <- list(eta = seq(0.1, 0.5, length.out = 3), lambda = 10 ^ seq(-1.0, 0.0, length.out = 3), objective = "binary:logistic") # fit models # note that we use 10 random search iterations here so that the example # finishes quickly, you would probably want something like 1000+ results <- fit_xgb_occupancy_models( site_data, feature_data, c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"), "survey_sensitivity", "survey_specificity", n_folds = rep(5, 2), xgb_early_stopping_rounds = rep(100, 2), xgb_tuning_parameters = parameters, n_threads = 1) # print best found model parameters print(results$parameters) # print model predictions print(results$predictions) # print model performance print(results$performance, width = Inf) ## End(Not run)
## Not run: # set seeds for reproducibility set.seed(123) # simulate data for 30 sites, 2 features, and 3 environmental variables site_data <- simulate_site_data( n_sites = 30, n_features = 2, n_env_vars = 3, prop = 0.1) feature_data <- simulate_feature_data(n_features = 2, prop = 1) # create list of possible tuning parameters for modeling parameters <- list(eta = seq(0.1, 0.5, length.out = 3), lambda = 10 ^ seq(-1.0, 0.0, length.out = 3), objective = "binary:logistic") # fit models # note that we use 10 random search iterations here so that the example # finishes quickly, you would probably want something like 1000+ results <- fit_xgb_occupancy_models( site_data, feature_data, c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"), "survey_sensitivity", "survey_specificity", n_folds = rep(5, 2), xgb_early_stopping_rounds = rep(100, 2), xgb_tuning_parameters = parameters, n_threads = 1) # print best found model parameters print(results$parameters) # print model predictions print(results$predictions) # print model performance print(results$performance, width = Inf) ## End(Not run)
Generate a survey scheme by maximizing the geographic coverage of surveys.
geo_cov_survey_scheme( site_data, cost_column, survey_budget, locked_in_column = NULL, locked_out_column = NULL, exclude_locked_out = FALSE, solver = "auto", verbose = FALSE )
geo_cov_survey_scheme( site_data, cost_column, survey_budget, locked_in_column = NULL, locked_out_column = NULL, exclude_locked_out = FALSE, solver = "auto", verbose = FALSE )
site_data |
|
cost_column |
|
survey_budget |
|
locked_in_column |
|
locked_out_column |
|
exclude_locked_out |
|
solver |
|
verbose |
|
The integer programming formulation of the p-Median problem (Daskin & Maass 2015) is used to generate survey schemes.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in a scheme or not. Columns
correspond to sites, and rows correspond to different schemes.
This function can use the Rsymphony package and the Gurobi optimization software to generate survey schemes. Although the Rsymphony package is easier to install because it is freely available on the The Comprehensive R Archive Network (CRAN), it is strongly recommended to install the Gurobi optimization software and the gurobi R package because it can generate survey schemes much faster. Note that special academic licenses are available at no cost. Installation instructions are available online for Linux, Windows, and Mac OS operating systems.
Daskin MS & Maass KL (2015) The p-median problem. In Location Science (pp. 21-45). Springer, Cham.
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), v1 = c(0.1, 0.2, 0.3, 10), # environmental axis 1 v2 = c(0.1, 0.2, 0.3, 10), # environmental axis 2 cost = rep(1, 4)), coords = c("x", "y")) # plot the sites' locations plot(st_geometry(x), pch = 16, cex = 3) # generate scheme with a budget of 2 s <- geo_cov_survey_scheme(x, "cost", 2) # print scheme print(s) # plot scheme x$scheme <- c(s) plot(x[, "scheme"], pch = 16, cex = 3)
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), v1 = c(0.1, 0.2, 0.3, 10), # environmental axis 1 v2 = c(0.1, 0.2, 0.3, 10), # environmental axis 2 cost = rep(1, 4)), coords = c("x", "y")) # plot the sites' locations plot(st_geometry(x), pch = 16, cex = 3) # generate scheme with a budget of 2 s <- geo_cov_survey_scheme(x, "cost", 2) # print scheme print(s) # plot scheme x$scheme <- c(s) plot(x[, "scheme"], pch = 16, cex = 3)
Calculate the total number of presence/absence states for a given number of sites and features.
n_states(n_sites, n_features)
n_states(n_sites, n_features)
n_sites |
|
n_features |
|
A numeric
value.
# calculate number of states for 3 sites and 2 features n_states(n_sites = 2, n_features = 3)
# calculate number of states for 3 sites and 2 features n_states(n_sites = 2, n_features = 3)
Find the optimal survey scheme that maximizes value of information. This function uses the exact method for calculating the expected value of the decision given a survey scheme.
optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_threads = 1, verbose = FALSE )
optimal_survey_scheme( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, site_management_cost_column, site_survey_cost_column, feature_survey_column, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column, feature_target_column, total_budget, survey_budget, site_management_locked_in_column = NULL, site_management_locked_out_column = NULL, site_survey_locked_out_column = NULL, prior_matrix = NULL, n_threads = 1, verbose = FALSE )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
site_management_cost_column |
|
site_survey_cost_column |
|
feature_survey_column |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
feature_target_column |
|
total_budget |
|
survey_budget |
|
site_management_locked_in_column |
|
site_management_locked_out_column |
|
site_survey_locked_out_column |
|
prior_matrix |
|
n_threads |
|
verbose |
|
The optimal survey scheme is determined using a brute-force algorithm.
Initially, all feasible (valid) survey schemes are identified given the
survey costs and the survey budget (using
feasible_survey_schemes()
. Next, the expected value of each and
every feasible survey scheme is computed
(using evdsi()
).
Finally, the greatest expected value is identified, and all survey schemes
that share this greatest expected value are returned. Due to the nature of
this algorithm, it can take a very long time to complete.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in the scheme or not. Columns
correspond to sites, and rows correspond to different schemes. If
there is only one optimal survey scheme then the matrix
will only
contain a single row. This matrix also has a numeric
"ev"
attribute that contains the expected value of each scheme.
Please note that this function requires the Gurobi optimization software (https://www.gurobi.com/) and the gurobi R package if different sites have different survey costs. Installation instruction are available online for Linux, Windows, and Mac OS (see https://support.gurobi.com/hc/en-us/articles/4534161999889-How-do-I-install-Gurobi-Optimizer).
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of managing all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 ## Not run: # find optimal survey scheme using exact method opt_survey <- optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(opt_survey) ## End(Not run)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # set total budget for managing sites for conservation # (i.e. 50% of the cost of managing all sites) total_budget <- sum(sim_sites$management_cost) * 0.5 # set total budget for surveying sites for conservation # (i.e. 40% of the cost of managing all sites) survey_budget <- sum(sim_sites$survey_cost) * 0.4 ## Not run: # find optimal survey scheme using exact method opt_survey <- optimal_survey_scheme( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "management_cost", "survey_cost", "survey", "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity", "target", total_budget, survey_budget) # print result print(opt_survey) ## End(Not run)
Create prior probability matrix for the value of information analysis.
prior_probability_matrix( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column )
prior_probability_matrix( site_data, feature_data, site_detection_columns, site_n_surveys_columns, site_probability_columns, feature_survey_sensitivity_column, feature_survey_specificity_column, feature_model_sensitivity_column, feature_model_specificity_column )
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_probability_columns |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
feature_model_sensitivity_column |
|
feature_model_specificity_column |
|
A matrix
object containing the prior probabilities of each
feature occupying each site. Each row corresponds to a different
feature and each column corresponds to a different site.
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # calculate prior probability matrix prior_matrix <- prior_probability_matrix( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity") # preview prior probability matrix print(prior_matrix)
# set seeds for reproducibility set.seed(123) # load example site data data(sim_sites) print(sim_sites) # load example feature data data(sim_features) print(sim_features) # calculate prior probability matrix prior_matrix <- prior_probability_matrix( sim_sites, sim_features, c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"), "survey_sensitivity", "survey_specificity", "model_sensitivity", "model_specificity") # preview prior probability matrix print(prior_matrix)
Calculate relative site richness scores. Sites with greater scores are predicted to be more likely to contain more species. Note that these scores are relative to each other and scores calculated using different matrices cannot be compared to each other.
relative_site_richness_scores(site_data, site_probability_columns)
relative_site_richness_scores(site_data, site_probability_columns)
site_data |
|
site_probability_columns |
|
The relative site richness scores are calculated using the following procedure:
Let denote the set of sites (indexed by
),
denote the set of features (indexed by
), and
denote the modeled probability of feature
occurring in sites
.
Next, we will sum the values for each site:
.
Finally, we will linearly rescale the values between 0.01
and 1 to produce the scores.
A numeric
vector of richness scores. Note that
these values are automatically rescaled between 0.01 and 1.
# set seed for reproducibility set.seed(123) # simulate data for 3 features and 4 planning units x <- tibble::tibble(x = rnorm(4), y = rnorm(4), p1 = c(0.095, 0.032, 0.5, 0.924), p2 = c(0.023, 0.014, 0.4, 0.919), p3 = c(0.075, 0.046, 0.9, 0.977)) x <- sf::st_as_sf(x, coords = c("x", "y")) # print data, # we can see that the fourth site has the highest modeled probabilities of # occupancy across all species print(x) # plot sites' occupancy probabilities plot(x[, c("p1", "p2", "p3")], pch = 16, cex = 3) # calculate scores s <- relative_site_richness_scores(x, c("p1", "p2", "p3")) # print scores, # we can see that site 4 has the highest richness score print(s) # plot sites' richness scores x$s <- s plot(x[, c("s")], pch = 16, cex = 3)
# set seed for reproducibility set.seed(123) # simulate data for 3 features and 4 planning units x <- tibble::tibble(x = rnorm(4), y = rnorm(4), p1 = c(0.095, 0.032, 0.5, 0.924), p2 = c(0.023, 0.014, 0.4, 0.919), p3 = c(0.075, 0.046, 0.9, 0.977)) x <- sf::st_as_sf(x, coords = c("x", "y")) # print data, # we can see that the fourth site has the highest modeled probabilities of # occupancy across all species print(x) # plot sites' occupancy probabilities plot(x[, c("p1", "p2", "p3")], pch = 16, cex = 3) # calculate scores s <- relative_site_richness_scores(x, c("p1", "p2", "p3")) # print scores, # we can see that site 4 has the highest richness score print(s) # plot sites' richness scores x$s <- s plot(x[, c("s")], pch = 16, cex = 3)
Calculate scores to describe the overall uncertainty of modeled species' occupancy predictions for each site. Sites with greater scores are associated with greater uncertainty. Note that these scores are relative to each other and uncertainty values calculated using different matrices cannot be compared to each other.
relative_site_uncertainty_scores(site_data, site_probability_columns)
relative_site_uncertainty_scores(site_data, site_probability_columns)
site_data |
|
site_probability_columns |
|
The relative site uncertainty scores are calculated as joint Shannon's entropy statistics. Since we assume that species occur independently of each other, we can calculate these statistics separately for each species in each site and then sum together the statistics for species in the same site:
Let denote the set of sites (indexed by
),
denote the set of features (indexed by
), and
denote the modeled probability of feature
occurring in sites
.
Next, we will calculate the Shannon's entropy statistic for each
species in each site:
Finally, we will sum the entropy statistics together for each site:
A numeric
vector of uncertainty scores. Note that
these values are automatically rescaled between 0.01 and 1.
# set seed for reproducibility set.seed(123) # simulate data for 3 features and 5 sites x <- tibble::tibble(x = rnorm(5), y = rnorm(5), p1 = c(0.5, 0, 1, 0, 1), p2 = c(0.5, 0.5, 1, 0, 1), p3 = c(0.5, 0.5, 0.5, 0, 1)) x <- sf::st_as_sf(x, coords = c("x", "y")) # print data, # we can see that site (row) 3 has the least certain predictions # because it has many values close to 0.5 print(x) # plot sites' occupancy probabilities plot(x[, c("p1", "p2", "p3")], pch = 16, cex = 3) # calculate scores s <- relative_site_uncertainty_scores(x, c("p1", "p2", "p3")) # print scores, # we can see that site 3 has the highest uncertainty score print(s) # plot sites' uncertainty scores x$s <- s plot(x[, c("s")], pch = 16, cex = 3)
# set seed for reproducibility set.seed(123) # simulate data for 3 features and 5 sites x <- tibble::tibble(x = rnorm(5), y = rnorm(5), p1 = c(0.5, 0, 1, 0, 1), p2 = c(0.5, 0.5, 1, 0, 1), p3 = c(0.5, 0.5, 0.5, 0, 1)) x <- sf::st_as_sf(x, coords = c("x", "y")) # print data, # we can see that site (row) 3 has the least certain predictions # because it has many values close to 0.5 print(x) # plot sites' occupancy probabilities plot(x[, c("p1", "p2", "p3")], pch = 16, cex = 3) # calculate scores s <- relative_site_uncertainty_scores(x, c("p1", "p2", "p3")) # print scores, # we can see that site 3 has the highest uncertainty score print(s) # plot sites' uncertainty scores x$s <- s plot(x[, c("s")], pch = 16, cex = 3)
Simulated data for prioritizing sites for ecological surveys.
data(sim_features) data(sim_sites)
data(sim_features) data(sim_sites)
sf::sf()
object.
tibble::tibble()
object.
.
The simulated datasets provide data for six sites and three features. The sites can potentially acquired for protected area establishment. However, existing information on the spatial distribution of the features is incomplete. Only some of the sites have existing ecological survey data. To help inform management decisions, species distribution models have been fitted to predict the probability of each species occupying each site.
sim_sites
This object describes the sites and contains the
following data: cost of surveying the sites (survey_cost
column),
cost of acquiring sites for conservation (management_cost
column),
results from previous ecological surveys (f1
, f2
, f3
columns),
previous survey effort (n1
, n2
, n3
columns),
environmental conditions of the sites (e1
, e2
columns),
and modeled probability of the features occupying the sites
(p1
, p2
, p3
columns).
sim_features
This object describes the features and contains
the following data:
the name of each feature (name
column),
whether each feature should be considered in future surveys
(survey
column),
the sensitivity and specificity of the survey methodology for each
the sensitivity and specificity of the species distribution model
for each feature (model_sensitivity
, model_specificity
columns),
and the representation target thresholds for each feature
(target
column).
These datasets were simulated using simulate_feature_data()
and simulate_site_data()
.
# load data data(sim_sites, sim_features) # print feature data print(sim_features, width = Inf) # print site data print(sim_sites, width = Inf)
# load data data(sim_sites, sim_features) # print feature data print(sim_features, width = Inf) # print site data print(sim_sites, width = Inf)
Simulate feature data for developing simulated survey schemes.
simulate_feature_data(n_features, proportion_of_survey_features = 1)
simulate_feature_data(n_features, proportion_of_survey_features = 1)
n_features |
|
proportion_of_survey_features |
|
A tibble::tibble()
object. It contains the following
data:
name
character
name of each feature.
survey
logical
(TRUE
/ FALSE
) values
indicating if each feature should be examined in surveys or not.
survey_sensitivity
numeric
sensitivity (true positive
rate) of the survey methodology for each features.
survey_specificity
numeric
specificity (true negative
rate) of the survey methodology for each features.
model_sensitivity
numeric
specificity (true positive
rate) of the occupancy models for each features.
model_specificity
numeric
specificity (true negative
rate) of the occupancy models for each features.
target
numeric
target values used to parametrize
the conservation benefit of managing of each feature (defaults to 1).
# set seed for reproducibility set.seed(123) # simulate data d <- simulate_feature_data(n_features = 5, proportion_of_survey_features = 0.5) # print data print(d, width = Inf)
# set seed for reproducibility set.seed(123) # simulate data d <- simulate_feature_data(n_features = 5, proportion_of_survey_features = 0.5) # print data print(d, width = Inf)
Simulate site data for developing simulated survey schemes.
simulate_site_data( n_sites, n_features, proportion_of_sites_missing_data, n_env_vars = 3, survey_cost_intensity = 20, survey_cost_scale = 5, management_cost_intensity = 100, management_cost_scale = 30, max_number_surveys_per_site = 5, output_probabilities = TRUE )
simulate_site_data( n_sites, n_features, proportion_of_sites_missing_data, n_env_vars = 3, survey_cost_intensity = 20, survey_cost_scale = 5, management_cost_intensity = 100, management_cost_scale = 30, max_number_surveys_per_site = 5, output_probabilities = TRUE )
n_sites |
|
n_features |
|
proportion_of_sites_missing_data |
|
n_env_vars |
|
survey_cost_intensity |
|
survey_cost_scale |
|
management_cost_intensity |
|
management_cost_scale |
|
max_number_surveys_per_site |
|
output_probabilities |
|
A sf::sf()
object with site data.
The "management_cost"
column contains the site protection costs,
and the "survey_cost"
column contains the costs for surveying
each site.
Additionally, columns that start with
(i) "f"
(e.g. "f1"
) contain the proportion of
times that each feature was detected in each site,
(ii) "n"
(e.g. "n1"
) contain the number of
of surveys for each feature within each site,
(iii) "p"
(e.g. "p1"
) contain prior
probability data, and
(iv) "e"
(e.g. "e1"
) contain environmental
data. Note that columns that contain the same integer value (excepting
environmental data columns) correspond to the same feature
(e.g. "d1"
, "n1"
, "p1"
contain data that correspond
to the same feature).
# set seed for reproducibility set.seed(123) # simulate data d <- simulate_site_data(n_sites = 10, n_features = 4, prop = 0.5) # print data print(d, width = Inf) # plot cost data plot(d[, c("survey_cost", "management_cost")], axes = TRUE, pch = 16, cex = 2) # plot environmental data plot(d[, c("e1", "e2", "e3")], axes = TRUE, pch = 16, cex = 2) # plot feature detection data plot(d[, c("f1", "f2", "f3", "f4")], axes = TRUE, pch = 16, cex = 2) # plot feature survey effort plot(d[, c("n1", "n2", "n3", "n4")], axes = TRUE, pch = 16, cex = 2) # plot feature prior probability data plot(d[, c("p1", "p2", "p3", "p4")], axes = TRUE, pch = 16, cex = 2)
# set seed for reproducibility set.seed(123) # simulate data d <- simulate_site_data(n_sites = 10, n_features = 4, prop = 0.5) # print data print(d, width = Inf) # plot cost data plot(d[, c("survey_cost", "management_cost")], axes = TRUE, pch = 16, cex = 2) # plot environmental data plot(d[, c("e1", "e2", "e3")], axes = TRUE, pch = 16, cex = 2) # plot feature detection data plot(d[, c("f1", "f2", "f3", "f4")], axes = TRUE, pch = 16, cex = 2) # plot feature survey effort plot(d[, c("n1", "n2", "n3", "n4")], axes = TRUE, pch = 16, cex = 2) # plot feature prior probability data plot(d[, c("p1", "p2", "p3", "p4")], axes = TRUE, pch = 16, cex = 2)
Decision support tool for prioritizing sites for ecological surveys based on their potential to improve plans for conserving biodiversity (e.g. plans for establishing protected areas). Given a set of sites that could potentially be acquired for conservation management – wherein some sites have previously been surveyed and other sites have not – it can be used to generate and evaluate plans for additional surveys. Specifically, plans for ecological surveys can be generated using various conventional approaches (e.g. maximizing expected species richness, geographic coverage, diversity of sampled environmental conditions) and by maximizing value of information. After generating plans for surveys, they can also be evaluated using value of information analysis.
Please note that several functions depend on the 'Gurobi' optimization software (available from https://www.gurobi.com) and the gurobi R package (installation instructions available online for Linux, Windows, and Mac OS). Additionally, the JAGS software (available from https://mcmc-jags.sourceforge.io/) is required to fit hierarchical generalized linear models.
Package authors:
Jeffrey O. Hanson [email protected] ORCID
Iadine Chadès [email protected] ORCID
Emma J. Hudgins [email protected] ORCID
Joseph R. Bennett [email protected] ORCID
The package vignette provides a tutorial
(accessible using the code vignettes("surveyvoi")
).
Generate a survey scheme by selecting the set of sites with the greatest overall weight value, a maximum budget for the survey scheme.
weighted_survey_scheme( site_data, cost_column, survey_budget, weight_column, locked_in_column = NULL, locked_out_column = NULL, solver = "auto", verbose = FALSE )
weighted_survey_scheme( site_data, cost_column, survey_budget, weight_column, locked_in_column = NULL, locked_out_column = NULL, solver = "auto", verbose = FALSE )
site_data |
|
cost_column |
|
survey_budget |
|
weight_column |
|
locked_in_column |
|
locked_out_column |
|
solver |
|
verbose |
|
Let denote the set of sites (indexed by
), and let
denote the maximum budget available for surveying the sites.
Next, let
represent the cost of surveying each site
, and
denote the relative value (weight) for
surveying each site
. The set of sites with the greatest
overall weight values, subject to a given budget can the be identified by
solving the following integer programming problem. Here,
is the binary decision variable indicating each if site
is selected in the survey scheme or not.
A matrix
of logical
(TRUE
/ FALSE
)
values indicating if a site is selected in a scheme or not. Columns
correspond to sites, and rows correspond to different schemes.
This function can use the Rsymphony package and the Gurobi optimization software to generate survey schemes. Although the Rsymphony package is easier to install because it is freely available on the The Comprehensive R Archive Network (CRAN), it is strongly recommended to install the Gurobi optimization software and the gurobi R package because it can generate survey schemes much faster. Note that special academic licenses are available at no cost. Installation instructions are available online for Linux, Windows, and Mac OS operating systems.
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), w = c(0.01, 10, 8, 1), cost = c(1, 1, 1, 1)), coords = c("x", "y")) # plot site' locations and color by weight values plot(x[, "w"], pch = 16, cex = 3) # generate scheme without any sites locked in s <- weighted_survey_scheme(x, cost_column = "cost", survey_budget = 2, weight_column = "w") # print solution print(s) # plot solution x$s <- c(s) plot(x[, "s"], pch = 16, cex = 3)
# set seed for reproducibility set.seed(123) # simulate data x <- sf::st_as_sf( tibble::tibble(x = rnorm(4), y = rnorm(4), w = c(0.01, 10, 8, 1), cost = c(1, 1, 1, 1)), coords = c("x", "y")) # plot site' locations and color by weight values plot(x[, "w"], pch = 16, cex = 3) # generate scheme without any sites locked in s <- weighted_survey_scheme(x, cost_column = "cost", survey_budget = 2, weight_column = "w") # print solution print(s) # plot solution x$s <- c(s) plot(x[, "s"], pch = 16, cex = 3)