Imagine you want to predict an environmental variable Y on one site based on the distribution of species of some kind you observed in this site. You will fit a model on data, gathered around the world, associating species distribution and Y.
Now, the problem is that species are not ubiquitous, and the distribution of species, although varying according to Y, also varies according to latitude and longitude. This means that a species A may be strongly linked to Y in England but not at all in Bulgaria for instance. Or a species B could be completely absent in some areas. Geographically weighted models tackle these issues by taking into account the spatial heterogeneity of species distribution.
It relies on two parameters :
The bandwidth determine the number of samples around our sample of interest used to fit the model. It can be of two kind :
Without giving much details, let’s say that lasso is a regression method that seeks to have the best accuracy while discarding variables that are not really informative for the prediction (the linera coefficient for these variables will be shrunk to zero).
It is especially useful in ecology to tackle the non ubiquitous nature of species. With each local regression, the species can have different informative values depending on the area they’re found. Lasso will take care of that.
One key element to fit geographically weighted models is the bandwidth. There is no rule of thumb, as it depends on the distribution of the samples used to fit the model on one hand, and the biogeographical properties of the investigated species. Nevertheless, if the samples used to fit the model are somewhat geographically evenly distributed, a fixed bandwidth will do the job.
To select the bandwidth you need :
The function gwl_bw_estimation run a bruteforce algorithm to select the bandwidth that produces the smallest RMSE.
In the following example :
# compute the distance matrix
distance_matrix <- compute_distance_matrix(Amesbury$coords, add.noise = TRUE)
# run the bw selection algorithm
bw_choice <- gwl_bw_estimation(x.var = Amesbury$spe.df,
y.var = Amesbury$WTD,
dist.mat = distance_matrix,
adaptive = TRUE,
adptbwd.thresh = 0.1,
kernel = "bisquare",
alpha = 1,
progress = TRUE,
n = 100)
This command is extremely long to run. For a dataset of 1100 samples and 45 species it takes approximately 24 hours.The time can be reduced by playing with the n parameter, that set the number of bandwidths to test. Parallelizing the code is a work in progress.
Once you have chosen a bandwidth, you can pursue. If you do not wich to run gwl_bw_estimation(), you can eyeball a bandwith, relying on your ecological expertise.
Let’s say that you decide to go for an adaptive bandwidth of 120. It means that for each sample, a model will be computed with the 120 closest sample.
# compute the distance matrix
distance_matrix <- compute_distance_matrix(Amesbury$coords, add.noise = TRUE)
my.gwl.fit <- gwl_fit(bw= 120,
x.var = Amesbury$spe.df,
y.var = Amesbury$WTD,
dist.mat = distance_matrix,
adaptive = TRUE,
kernel = "bisquare",
alpha = 1,
progress = TRUE)
plot(my.gwl.fit)
The graph rendered by this chunk displayed shows you whether coefficient was set to zero or not for each species, fo each model (remember that in a GWL there is one local modal per sample). The upper panel show you the mean of the y.var values of the samples used to fit the local model. In ecology, it may help to grasp a sense of the indicative value of a taxa for a given ecological variable.