This vignette provides an introduction to the
c4h_verify() function, used for forecast verification.
Forecast verification is an important step in evaluating the performance
of climate forecasts, especially relevant in the development of
climate-related early warning systems, since a forecast with poor skill
may not provide reliable information to users. Verification involves
comparing forecasted values (from a hindcast) to observed values
(e.g. from reanalysis) to assess the performance of the prediction
system at a given location, time, and leadtime. It is often necessary to
downscale the hindcast to the same spatial resolution as the
observations before performing the verification, which can be done using
c4h_downscale().
Within this vignette, the different verification metrics are described, and examples of how to use the function are provided.
First, let’s load the package.
Let’s load some sample data that we can use to demonstrate the verification methods.
hcst_path <- system.file("extdata/hindcast/", package = "clim4health")
hcst_path <- paste0(hcst_path, "/")
hcst <- c4h_load(hcst_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = "all",
ext = "nc")
rean_path <- system.file("extdata/reanalysis/", package = "clim4health")
rean_path <- paste0(rean_path, "/")
rean <- c4h_load(rean_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")Look at the dimensions of the observed and hindcast datasets (recall
we can use dim(rean$data) or rean$dims). We
can see that rean has more latitude and longitude points,
despite covering the same area - it has higher spatial
resolution.
First, we need to downscale the hindcast to the same spatial resolution as the observations. In the same step, let’s convert to degrees Celsius.
hcst <- c4h_convert_units(hcst, to = "celsius")
rean <- c4h_convert_units(rean, to = "celsius")
dwn <- c4h_downscale("Intbc", exp = hcst, obs = rean,
method_remap = "bilinear", method_bc = "evmos")💡 Note: Instead of downscaling, you could also spatially aggregate the hindcast and reanalysis data using
c4h_space()to obtain the same spatial resolution, before calculating any skill metrics.
Let’s explore the functionality of c4h_verify() and the
different verification metrics that can be calculated, before applying
them in examples.
c4h_verify() takes many input parameters depending on
the skill metrics to be calculated. The following are the key input
arguments:
exp: an s2dv_cube object containing the
experimental data.obs: an s2dv_cube object containing the
observational data.metrics: a character vector identifying the
verification metrics to be calculated.Depending on the metrics chosen, additional optional arguments may be selected, which will be discussed below.
There are many widely-used metrics to assess the quality of a
forecast product. These typically compare the hindcasts and observations
to see when the hindcast makes reasonable predictions of the observed
values. The following are optional metrics that may be calculated within
c4h_verify() and are selected using the
metrics parameter.
The metrics are calculated by comparing forecast ensemble member predictions to observations.
"BSS" - Brier Skill ScoreThis is a skill score for binary or categorical outcomes (for example, does a forecast successfully predict above normal temperatures?). The Brier Score (BS) measures the mean squared error between forecast probabilities and actual outcomes, and the associated skill score (BSS) compares the forecast’s BS to a reference forecast BS (often the climatology).
In the example of predicting above-median temperatures in a given month:
Two commonly used thresholds for BSS are BSS10 and BSS90: these define the categories as the 10th and 90th percentiles respectively. They are useful to assess whether the forecast correctly predicts extremes.
For example, BSS90 assesses whether the forecast can predict when temperatures exceed the 90th percentile threshold (the 90th percentile is the value that 90% of the observed temperatures fall below). In this case:
"RPSS" - Ranked Probability Skill ScoreThis skill score is based on the Ranked Probability Score (RPS), which extends the Brier Score (and associated skill) to multiple ordered categories. For example, you may wish to assess a forecast’s ability to predict below-normal, near-normal and above-normal temperatures, where above-normal corresponds to temperatures over the 66th percentile, and below-normal to those below the 33rd.
"CRPSS" - Continuous Ranked Probability Skill
ScoreThe Continuous Ranked Probability Score (CRPS) measures the integrated squared difference between the forecast’s cumulative probability distribution and the observed value, represented as a step function. It differs from the BS and RPS in that it considers the full continuous probability distribution rather than discrete categories. The Continuous Ranked Probability Skill Score (CRPSS) compares the CRPS of a forecast to that of a reference (often climatology).
"AbsBiasSS" - Absolute Bias Skill ScoreAbsolute bias (AbsBias) measures the absolute difference between the mean forecast value (over ensemble members) and the mean observed value (the climatological mean). Its associated skill score (AbsBiasSS) compares the forecast AbsBias with that of a climatological reference.
"MSE" - Mean Squared ErrorThe average of the squared differences between forecast ensemble mean and observed values. The lower the value, the smaller the errors. Large errors contribute more to the MSE because of the squared transformation.
"MSSS" - Mean Squared Error Skill ScoreThe skill score associated with the MSE.
"RMSE" - Root Mean Squared ErrorThe square root of the MSE, returned in the same units as the original forecast and observation values. The lower the value, the smaller the errors. Large errors are still weighted more heavily due to the squaring step, but the square root makes the result more interpretable than the MSE.
"RMSSS" - Root Mean Square Skill ScoreThe skill score associated with the Root Mean Squared Error (RMSE), which is the square root of the MSE.
"ROCSS" - Relative Operating Characteristic Skill
ScoreThe Relative Operating Characteristic Skill Score (ROCSS) assesses the discriminatory ability of a forecast. It is based on the ROC curve, which plots the Hit Rate against the False Alarm Rate across multiple probability thresholds:
For a binary event such as “temperature above mean” and a probabilistic forecast, a probability threshold is applied to convert the forecast into a binary prediction (e.g. “predict the event if more than 50% of ensemble members predict it”). By varying this threshold, a Hit Rate and False Alarm Rate are calculated at each value, and the ROC curve is plotted comparing these (example below).
💡 Note: In the context of statistical modelling (for example, of disease cases), the Hit Rate is the same as the sensitivity, and the False Alarm Rate is the same as
1 - specificity.
A “good” forecast has a high hit rate and low false alarm rate, so lies above the diagonal
A random forecast performs no better than chance, producing equal hit and false alarm rates at every threshold, and so lies on the diagonal
A “bad” forecast has a higher false alarm rate than hit rate, so lies below the diagonal.
To calculate the ROCSS, we calculate the area under the curve (AUC), which ranges from 0 to 1 where a higher value indicates better discrimination. The ROCSS = 2*AUC - 1, giving:
ROCSS = 1: perfect discrimination — at some probability threshold, the forecast achieves a hit rate of 100% and a false alarm rate of 0%
ROCSS = 0: no discrimination — the forecast performs no better than chance
ROCSS < 0: worse than random — the forecast has a higher false alarm rate than hit rate, and would perform better if predictions were reversed.
For skill scores where significance testing is applicable (BSS, RPSS, CRPSS, MSSS, RMSSS, ROCSS), a p-value is calculated to assess whether the forecast skill is significantly different from zero (i.e. significantly better or worse than climatology). A significance threshold of α = 0.05 is applied by default, meaning that a skill score is considered statistically significant if there is less than a 5% probability that the observed skill arose by chance.
💡 Note: Significance testing is not typically applied to raw error metrics (MSE, RMSE, AbsBias) since these are not expressed relative to a reference and do not have a natural null hypothesis of zero.
Using the data loaded and downscaled above, let’s perform an example skill assessment. In this example, we have only loaded 3 years of data for each leadtime, which means that we do not have a lot of information to assess whether the hindcast is skillful or not. However, the general process is (for gridded data):
skill <- c4h_verify(exp = dwn$exp,
obs = dwn$obs,
metrics = c("BSS", "CRPSS"),
brier_thresholds = c(0.1, 0.9))The output of c4h_verify is a list of lists. The outer
list contains the names of the metrics, and the inner contains the value
of the metric and its significance, if applicable. In the above, we
calculated the Brier Skill Score (BSS) at the 10th and 90th percentiles
and Continuous Ranked Probability Skill Score (CRPSS), so we can find
them in the list as:
Within each metric, we save the value and the significance (by
default alpha = 0.05). Looking at the dimensions of the outputs, note
that the sdate dimension will now always have length 1,
because the skill is calculated by assessing statistical relationships
across the start dates.
💡 Note: Because we have loaded so little data, the significance of all our metrics is FALSE. If we were to load more data, we would see that our hindcast has significant skill in some locations or for some dates.
Each value and significance is an s2dv_cube object. They
can be plotted using c4h_plotskill().
There are many optional arguments in c4h_verify() that
the user can specify to tailor their skill assessment. Currently, we
recommend using the default values for most of these, but here we
provide some guidance on how to choose them.
💡 Note: Currently,
c4h_verify()applies arguments across all metrics (except for a minority of cases, explained below). This is why it is recommended to use the default values. In future, it will be possible to specify arguments for specific metrics.
refThis argument should include an optional s2dv_cube containing a
reference forecast to be used in the skill score calculations. If not
provided, the climatological forecast is used as the reference forecast.
In this case, the skill scores are then calculated by comparing the
forecast provided in exp to the climatology of a forecast
provided in ref - you would thus expect the skill scores to
be 0 everywhere if using the same forecast in exp and
ref.
brier_thresholdsThis argument is only relevant if “BSS” is included in
metrics. It is a numeric vector of thresholds to be applied
to calculate different Brier Skill Scores. The default is
c(0.1, 0.9), which corresponds to the BSS10 and BSS90
respectively. You could specify other thresholds, for example
c(0.33, 0.66) to calculate the BSS33 and BSS66, which would
correspond to the 33rd and 66th percentiles respectively. In this case,
you would then be assessing the forecast’s ability to predict
below-normal and above-normal values respectively.
prob_thresholdsprob_thresholds acts similarly to
brier_thresholds, but it is used when “RPSS” or “ROCSS” is
in metrics. The default is c(1/3, 2/3), which
corresponds to assessing the ability of the forecast to predict
below-normal, normal, and above-normal values simultaneously. You could
specify other thresholds, such as c(0.2, 0.4, 0.6, 0.8) to
assess the forecast’s ability to predict values in the 5 categories
defined by the quintiles.
sig_method and sig_method.typesig_method indicates the method used to calculate
significance. The default is different for different metrics. For “BSS”,
“RPSS”, “CRPSS”, and “AbsBiasSS”, the default is
"RandomWalk". For “MSSS” and “RMSSS”, the default is
“one-sided Fisher”. "Diebold-Mariano" is also possible for
“RPSS” and “BSS”.
sig_method.type indicates the type of test to be applied
if using a Random Walk sig_method. The default is
"two.sided.approx", which assesses whether exp
and ref are significantly different in skill with a
two-sided test using an approximation. Other options include
"two.sided", which uses an exact two-sided test, and
"greater" or "less". "greater"
assesses whether exp shows significantly better skill than
ref with a one-sided test for negatively-oriented scores
(scores where lower values are better), while "less"
assesses whether exp shows significantly better skill than
ref with a one-sided test for positively-oriented scores
(scores where higher values are better). Examples of negatively-oriented
scores are “RPS” and “RMSE”, and examples of positively-oriented scores
are “ROC”. Thus, if you wish to assess skill using a one-sided test, you
could set sig_method.type = "greater" for calculating
“RPSS” and “RMSSS”, and sig_method.type = "less" for
calculating “ROCSS”. More details can be found in the documentation of
s2dv::RandomWalkTest().
alphaThis is the significance level to be applied in the significance testing of the skill scores. The default is 0.05. This value is applied across all metrics, but in some specific cases, the value will be incompatible with a chosen significance test, and the default will be applied. In such cases, a warning is issued!
ncoresThis is the number of cores to be used in the parallelisation of the skill score calculations. The default is 1, so no parallelisation is applied. If you have a large dataset and want to speed up the calculations, you can increase this value. An error will be returned if you specify a value higher than the number of cores available on your machine.
N.effThis argument specifies the effective sample size to use in the significance testing for metrics “BSS”, “RPSS”, “CRPSS”, or “RMSSS”. The default is NA, which uses the effective number of independent observations calculated from the data using s2dv::Eno().
indices_for_climThis is a vector of the indices to be taken along the
sdate dimension for computing the thresholds between
probabilistic categories. If NULL (default), the whole time period is
used. If you wanted to specify the probability thresholds based only on
the first 10 years of the data, you could set
indices_for_clim = 1:10. This argument is only relevant if
“BSS”, “RPSS”, “CRPSS”, or “ROCSS” is in metrics.
cross.val and clim.cross.valThese arguments are logical values (TRUE/FALSE) that indicate whether to apply cross-validation (i.e. excluding values) in the calculation of the probability categories. It is recommended to use their default values (FALSE, TRUE, respectively).
weights_exp and weights_refThis is an array to use to weight the forecast/reference ensemble members in the probability category calculations. The default is NULL, which means that no weighting is applied. This is recommended.
comp_dim and limitscomp_dim refers to the dimension along which to only
take into account complete data. That is to say, data is removed along
comp_dim if there is at least one NA between
limits, where the argument limits specifies
the range of indices along comp_dim to be considered. It is
recommended to use the default values for these arguments, which means
that no data is removed.
confIf “MSE” or “RMSE” is in metrics, conf
specifies whether or not to return the confidence intervals. By default,
they are returned (conf = TRUE).
na.rmBy default, this is FALSE, meaning that NA values are not removed in the calculations of “BSS”, “RPSS”, or “AbsBiasSS”. If any NA value is present, then the function will return NA there.
sign and pvalFor the metrics “MSSS” and “RMSSS”, sign and
pval specify whether or not to compute the statistical
significance and p-values of the test Ho: (R)MSSS = 0. By default,
sign is TRUE and pval is FALSE. The default is
that the significance is returned (TRUE/FALSE), but the p-values used to
calculate this are not).
rocss_catThis argument should be an integer indicating the category to be
returned when calculating “ROCSS”. The default is 1, which corresponds
to the first category defined by prob_thresholds. With the
default prob_thresholds = c(1/3, 2/3), this corresponds to
the first category (i.e. below normal). If you wanted to calculate the
ROCSS for the second category (i.e. normal), you would set
rocss_cat = 2. It is currently only possible to return one
category at a time.