clim4health verification

Overview

This vignette provides an introduction to the c4h_verify() function, used for forecast verification. Forecast verification is an important step in evaluating the performance of climate forecasts, especially relevant in the development of climate-related early warning systems, since a forecast with poor skill may not provide reliable information to users. Verification involves comparing forecasted values (from a hindcast) to observed values (e.g. from reanalysis) to assess the performance of the prediction system at a given location, time, and leadtime. It is often necessary to downscale the hindcast to the same spatial resolution as the observations before performing the verification, which can be done using c4h_downscale().

Within this vignette, the different verification metrics are described, and examples of how to use the function are provided.

0. Load the package and sample data

First, let’s load the package.

library(clim4health)

Let’s load some sample data that we can use to demonstrate the verification methods.

hcst_path <- system.file("extdata/hindcast/", package = "clim4health")
hcst_path <- paste0(hcst_path, "/")
hcst <- c4h_load(hcst_path,
                 variable = "t2m",
                 year = 2010:2012,
                 month = 1,
                 leadtime_month = "all",
                 ext = "nc")

rean_path <- system.file("extdata/reanalysis/", package = "clim4health")
rean_path <- paste0(rean_path, "/")
rean <- c4h_load(rean_path,
                 variable = "t2m",
                 year = 2010:2012,
                 month = 1,
                 leadtime_month = 1:3,
                 ext = "nc")

Look at the dimensions of the observed and hindcast datasets (recall we can use dim(rean$data) or rean$dims). We can see that rean has more latitude and longitude points, despite covering the same area - it has higher spatial resolution.

First, we need to downscale the hindcast to the same spatial resolution as the observations. In the same step, let’s convert to degrees Celsius.

hcst <- c4h_convert_units(hcst, to = "celsius")
rean <- c4h_convert_units(rean, to = "celsius")

dwn <- c4h_downscale("Intbc", exp = hcst, obs = rean,
                     method_remap = "bilinear", method_bc = "evmos")

💡 Note: Instead of downscaling, you could also spatially aggregate the hindcast and reanalysis data using c4h_space() to obtain the same spatial resolution, before calculating any skill metrics.

Let’s explore the functionality of c4h_verify() and the different verification metrics that can be calculated, before applying them in examples.

1. Key c4h_verify arguments

c4h_verify() takes many input parameters depending on the skill metrics to be calculated. The following are the key input arguments:

exp: an s2dv_cube object containing the experimental data.
obs: an s2dv_cube object containing the observational data.
metrics: a character vector identifying the verification metrics to be calculated.

Depending on the metrics chosen, additional optional arguments may be selected, which will be discussed below.

2. Verification metrics

There are many widely-used metrics to assess the quality of a forecast product. These typically compare the hindcasts and observations to see when the hindcast makes reasonable predictions of the observed values. The following are optional metrics that may be calculated within c4h_verify() and are selected using the metrics parameter.

The metrics are calculated by comparing forecast ensemble member predictions to observations.

`"BSS"` - Brier Skill Score

This is a skill score for binary or categorical outcomes (for example, does a forecast successfully predict above normal temperatures?). The Brier Score (BS) measures the mean squared error between forecast probabilities and actual outcomes, and the associated skill score (BSS) compares the forecast’s BS to a reference forecast BS (often the climatology).

In the example of predicting above-median temperatures in a given month:

The median temperature for a given month is the 50th percentile temperature across all the years of the observations. Temperatures are classified as “above-median” if they exceed this average.
The climatological reference forecast assigns a 50% probability to above-median temperatures every time (since by definition half of all observed values lie above the median). The BSS measures how much better the ensemble forecast is compared to always predicting this 50% probability.
Comparing the Brier Scores of the forecast to the climatological reference:
- BSS_forecast = 1: perfect forecast - all ensemble members predict above-median temperatures on every occasion when the event (observed temperature above the median) occurs, and no ensemble members predict it when it does not occur
- BSS_forecast = 0: no added skill over climatology — the forecast performs no better than always predicting a 50% probability
- BSS_forecast < 0: the forecast performs worse than climatology and provides no useful additional information. For example, for a given observed “above-median” month, perhaps only 10% of ensemble members predict above-median temperatures.

Two commonly used thresholds for BSS are BSS10 and BSS90: these define the categories as the 10th and 90th percentiles respectively. They are useful to assess whether the forecast correctly predicts extremes.

For example, BSS90 assesses whether the forecast can predict when temperatures exceed the 90th percentile threshold (the 90th percentile is the value that 90% of the observed temperatures fall below). In this case:

The 90th percentile temperature for a given month is calculated by considering the observed temperature distribution for that month across all years of the observations. The climatological reference forecast assigns a 10% probability to above-90th-percentile temperatures every time (since by definition only 10% of observed values exceed this threshold). The BSS90 measures how much better the ensemble forecast is compared to always predicting this 10% probability.
Comparing the Brier Scores of the forecast to the climatological reference:
- BSS_forecast = 1: perfect forecast — all ensemble members predict above-90th-percentile temperatures on every occasion when the event occurs, and no ensemble members predict it when it does not
- BSS_forecast = 0.5: added skill over climatology - the forecast performs better than always predicting a 10% probability
- BSS_forecast = 0: no added skill over climatology — the forecast performs no better than always predicting a 10% probability
- BSS_forecast < 0: the forecast performs worse than climatology and provides no useful additional information. For example, for a given observed above-90th-percentile month, perhaps only 2% of ensemble members predict above-90th-percentile temperatures.

`"RPSS"` - Ranked Probability Skill Score

This skill score is based on the Ranked Probability Score (RPS), which extends the Brier Score (and associated skill) to multiple ordered categories. For example, you may wish to assess a forecast’s ability to predict below-normal, near-normal and above-normal temperatures, where above-normal corresponds to temperatures over the 66th percentile, and below-normal to those below the 33rd.

The climatology by definition correctly categorises temperatures as “above normal” 33% of the time, “near normal” 33% of the time, and “below normal” 33% of the time. The RPSS measures how much better the ensemble forecast is compared to always predicting this uniform 33%/33%/33% distribution across categories.
Comparing the Ranked Probability Scores of the forecast to the climatological reference:
- RPSS_forecast = 1: perfect forecast — the ensemble assigns 100% probability to the correct temperature category every time (every ensemble member predicts the correct observed category on every occasion)
- RPSS_forecast = 0: no added skill over climatology — the forecast performs no better than always predicting an equal probability across all three categories
- RPSS_forecast < 0: the forecast performs worse than climatology and provides no useful additional information. For example, the ensemble may systematically assign high probability to the wrong category.

`"CRPSS"` - Continuous Ranked Probability Skill Score

The Continuous Ranked Probability Score (CRPS) measures the integrated squared difference between the forecast’s cumulative probability distribution and the observed value, represented as a step function. It differs from the BS and RPS in that it considers the full continuous probability distribution rather than discrete categories. The Continuous Ranked Probability Skill Score (CRPSS) compares the CRPS of a forecast to that of a reference (often climatology).

CRPSS values:
- CRPSS_forecast = 1: perfect forecast — the forecast probability distribution perfectly matches the observed value every time (i.e. every ensemble member predicts the observed value on every occasion - a highly unlikely scenario!)
- CRPSS_forecast = 0: no added skill over climatology — the forecast performs no better than the climatological probability distribution
- CRPSS_forecast < 0: the forecast performs worse than climatology and provides no useful additional information.

`"AbsBiasSS"` - Absolute Bias Skill Score

Absolute bias (AbsBias) measures the absolute difference between the mean forecast value (over ensemble members) and the mean observed value (the climatological mean). Its associated skill score (AbsBiasSS) compares the forecast AbsBias with that of a climatological reference.

`"MSE"` - Mean Squared Error

The average of the squared differences between forecast ensemble mean and observed values. The lower the value, the smaller the errors. Large errors contribute more to the MSE because of the squared transformation.

`"MSSS"` - Mean Squared Error Skill Score

The skill score associated with the MSE.

As before, MSSS values for a forecast when compared to a reference (climatology):
- MSSS_forecast = 1: a perfect forecast
- MSSS_forecast = 0: the forecast performs as well as the climatology with no additional skill
- MSSS_forecast < 0: the forecast performs worse than the climatology.

`"RMSE"` - Root Mean Squared Error

The square root of the MSE, returned in the same units as the original forecast and observation values. The lower the value, the smaller the errors. Large errors are still weighted more heavily due to the squaring step, but the square root makes the result more interpretable than the MSE.

`"RMSSS"` - Root Mean Square Skill Score

The skill score associated with the Root Mean Squared Error (RMSE), which is the square root of the MSE.

`"ROCSS"` - Relative Operating Characteristic Skill Score

The Relative Operating Characteristic Skill Score (ROCSS) assesses the discriminatory ability of a forecast. It is based on the ROC curve, which plots the Hit Rate against the False Alarm Rate across multiple probability thresholds:

Hit Rate: the forecast predicts the event, and it occurs
False Alarm Rate: the forecast predicts the event, but it does not occur

For a binary event such as “temperature above mean” and a probabilistic forecast, a probability threshold is applied to convert the forecast into a binary prediction (e.g. “predict the event if more than 50% of ensemble members predict it”). By varying this threshold, a Hit Rate and False Alarm Rate are calculated at each value, and the ROC curve is plotted comparing these (example below).

💡 Note: In the context of statistical modelling (for example, of disease cases), the Hit Rate is the same as the sensitivity, and the False Alarm Rate is the same as 1 - specificity.

A “good” forecast has a high hit rate and low false alarm rate, so lies above the diagonal
A random forecast performs no better than chance, producing equal hit and false alarm rates at every threshold, and so lies on the diagonal
A “bad” forecast has a higher false alarm rate than hit rate, so lies below the diagonal.

To calculate the ROCSS, we calculate the area under the curve (AUC), which ranges from 0 to 1 where a higher value indicates better discrimination. The ROCSS = 2*AUC - 1, giving:

ROCSS = 1: perfect discrimination — at some probability threshold, the forecast achieves a hit rate of 100% and a false alarm rate of 0%
ROCSS = 0: no discrimination — the forecast performs no better than chance
ROCSS < 0: worse than random — the forecast has a higher false alarm rate than hit rate, and would perform better if predictions were reversed.

Statistical Significance

For skill scores where significance testing is applicable (BSS, RPSS, CRPSS, MSSS, RMSSS, ROCSS), a p-value is calculated to assess whether the forecast skill is significantly different from zero (i.e. significantly better or worse than climatology). A significance threshold of α = 0.05 is applied by default, meaning that a skill score is considered statistically significant if there is less than a 5% probability that the observed skill arose by chance.

A skill score marked as significant (p < 0.05) suggests that the forecast has genuine skill beyond climatology, and the result is unlikely to be due to random variation
A skill score that is not significant (p ≥ 0.05) does not necessarily mean the forecast has no skill — it may reflect insufficient sample size (e.g. a short hindcast period) rather than a truly unskillful forecast.

💡 Note: Significance testing is not typically applied to raw error metrics (MSE, RMSE, AbsBias) since these are not expressed relative to a reference and do not have a natural null hypothesis of zero.

3. Example skill assessment

Using the data loaded and downscaled above, let’s perform an example skill assessment. In this example, we have only loaded 3 years of data for each leadtime, which means that we do not have a lot of information to assess whether the hindcast is skillful or not. However, the general process is (for gridded data):

skill <- c4h_verify(exp = dwn$exp, 
                    obs = dwn$obs,
                    metrics = c("BSS", "CRPSS"),
                    brier_thresholds = c(0.1, 0.9))

The output of c4h_verify is a list of lists. The outer list contains the names of the metrics, and the inner contains the value of the metric and its significance, if applicable. In the above, we calculated the Brier Skill Score (BSS) at the 10th and 90th percentiles and Continuous Ranked Probability Skill Score (CRPSS), so we can find them in the list as:

names(skill)

Within each metric, we save the value and the significance (by default alpha = 0.05). Looking at the dimensions of the outputs, note that the sdate dimension will now always have length 1, because the skill is calculated by assessing statistical relationships across the start dates.

names(skill$BSS10)

print(skill$BSS10$bss$dims)

💡 Note: Because we have loaded so little data, the significance of all our metrics is FALSE. If we were to load more data, we would see that our hindcast has significant skill in some locations or for some dates.

Each value and significance is an s2dv_cube object. They can be plotted using c4h_plotskill().

c4h_plotskill(skill$BSS10$bss, skill$BSS10$sign, ensemble = TRUE, boundaries = "countries")

Choosing arguments in c4h_verify

There are many optional arguments in c4h_verify() that the user can specify to tailor their skill assessment. Currently, we recommend using the default values for most of these, but here we provide some guidance on how to choose them.

💡 Note: Currently, c4h_verify() applies arguments across all metrics (except for a minority of cases, explained below). This is why it is recommended to use the default values. In future, it will be possible to specify arguments for specific metrics.

`ref`

This argument should include an optional s2dv_cube containing a reference forecast to be used in the skill score calculations. If not provided, the climatological forecast is used as the reference forecast. In this case, the skill scores are then calculated by comparing the forecast provided in exp to the climatology of a forecast provided in ref - you would thus expect the skill scores to be 0 everywhere if using the same forecast in exp and ref.

`brier_thresholds`

This argument is only relevant if “BSS” is included in metrics. It is a numeric vector of thresholds to be applied to calculate different Brier Skill Scores. The default is c(0.1, 0.9), which corresponds to the BSS10 and BSS90 respectively. You could specify other thresholds, for example c(0.33, 0.66) to calculate the BSS33 and BSS66, which would correspond to the 33rd and 66th percentiles respectively. In this case, you would then be assessing the forecast’s ability to predict below-normal and above-normal values respectively.

`prob_thresholds`

prob_thresholds acts similarly to brier_thresholds, but it is used when “RPSS” or “ROCSS” is in metrics. The default is c(1/3, 2/3), which corresponds to assessing the ability of the forecast to predict below-normal, normal, and above-normal values simultaneously. You could specify other thresholds, such as c(0.2, 0.4, 0.6, 0.8) to assess the forecast’s ability to predict values in the 5 categories defined by the quintiles.

`sig_method` and `sig_method.type`

sig_method indicates the method used to calculate significance. The default is different for different metrics. For “BSS”, “RPSS”, “CRPSS”, and “AbsBiasSS”, the default is "RandomWalk". For “MSSS” and “RMSSS”, the default is “one-sided Fisher”. "Diebold-Mariano" is also possible for “RPSS” and “BSS”.

sig_method.type indicates the type of test to be applied if using a Random Walk sig_method. The default is "two.sided.approx", which assesses whether exp and ref are significantly different in skill with a two-sided test using an approximation. Other options include "two.sided", which uses an exact two-sided test, and "greater" or "less". "greater" assesses whether exp shows significantly better skill than ref with a one-sided test for negatively-oriented scores (scores where lower values are better), while "less" assesses whether exp shows significantly better skill than ref with a one-sided test for positively-oriented scores (scores where higher values are better). Examples of negatively-oriented scores are “RPS” and “RMSE”, and examples of positively-oriented scores are “ROC”. Thus, if you wish to assess skill using a one-sided test, you could set sig_method.type = "greater" for calculating “RPSS” and “RMSSS”, and sig_method.type = "less" for calculating “ROCSS”. More details can be found in the documentation of s2dv::RandomWalkTest().

`alpha`

This is the significance level to be applied in the significance testing of the skill scores. The default is 0.05. This value is applied across all metrics, but in some specific cases, the value will be incompatible with a chosen significance test, and the default will be applied. In such cases, a warning is issued!

`ncores`

This is the number of cores to be used in the parallelisation of the skill score calculations. The default is 1, so no parallelisation is applied. If you have a large dataset and want to speed up the calculations, you can increase this value. An error will be returned if you specify a value higher than the number of cores available on your machine.

`N.eff`

This argument specifies the effective sample size to use in the significance testing for metrics “BSS”, “RPSS”, “CRPSS”, or “RMSSS”. The default is NA, which uses the effective number of independent observations calculated from the data using s2dv::Eno().

`indices_for_clim`

This is a vector of the indices to be taken along the sdate dimension for computing the thresholds between probabilistic categories. If NULL (default), the whole time period is used. If you wanted to specify the probability thresholds based only on the first 10 years of the data, you could set indices_for_clim = 1:10. This argument is only relevant if “BSS”, “RPSS”, “CRPSS”, or “ROCSS” is in metrics.

`cross.val` and `clim.cross.val`

These arguments are logical values (TRUE/FALSE) that indicate whether to apply cross-validation (i.e. excluding values) in the calculation of the probability categories. It is recommended to use their default values (FALSE, TRUE, respectively).

`weights_exp` and `weights_ref`

This is an array to use to weight the forecast/reference ensemble members in the probability category calculations. The default is NULL, which means that no weighting is applied. This is recommended.

`comp_dim` and `limits`

comp_dim refers to the dimension along which to only take into account complete data. That is to say, data is removed along comp_dim if there is at least one NA between limits, where the argument limits specifies the range of indices along comp_dim to be considered. It is recommended to use the default values for these arguments, which means that no data is removed.

`conf`

If “MSE” or “RMSE” is in metrics, conf specifies whether or not to return the confidence intervals. By default, they are returned (conf = TRUE).

`na.rm`

By default, this is FALSE, meaning that NA values are not removed in the calculations of “BSS”, “RPSS”, or “AbsBiasSS”. If any NA value is present, then the function will return NA there.

`sign` and `pval`

For the metrics “MSSS” and “RMSSS”, sign and pval specify whether or not to compute the statistical significance and p-values of the test Ho: (R)MSSS = 0. By default, sign is TRUE and pval is FALSE. The default is that the significance is returned (TRUE/FALSE), but the p-values used to calculate this are not).

`rocss_cat`

This argument should be an integer indicating the category to be returned when calculating “ROCSS”. The default is 1, which corresponds to the first category defined by prob_thresholds. With the default prob_thresholds = c(1/3, 2/3), this corresponds to the first category (i.e. below normal). If you wanted to calculate the ROCSS for the second category (i.e. normal), you would set rocss_cat = 2. It is currently only possible to return one category at a time.

clim4health verification

Overview

0. Load the package and sample data

1. Key c4h_verify arguments

2. Verification metrics

"BSS" - Brier Skill Score

"RPSS" - Ranked Probability Skill Score

"CRPSS" - Continuous Ranked Probability Skill Score

"AbsBiasSS" - Absolute Bias Skill Score

"MSE" - Mean Squared Error

"MSSS" - Mean Squared Error Skill Score

"RMSE" - Root Mean Squared Error

"RMSSS" - Root Mean Square Skill Score

"ROCSS" - Relative Operating Characteristic Skill Score

Statistical Significance

3. Example skill assessment

Choosing arguments in c4h_verify

ref

brier_thresholds

prob_thresholds

sig_method and sig_method.type

alpha

ncores

N.eff

indices_for_clim

cross.val and clim.cross.val

weights_exp and weights_ref

comp_dim and limits

conf

na.rm

sign and pval

rocss_cat