clim4health is an R package designed to load, transform, process, plot, and save climate data. It is specifically designed for use in downscaling and verification of seasonal forecast data.
The clim4health package is intended for use with other packages developed within the HARMONIZE project, to spatially and temporally harmonize multiple sources of data for use in health impact studies.
This vignette provides an overview of the
clim4health package methodology, the core functions,
and a simple example workflow using real data. It deals with climate
data in multidimensional arrays called s2dv_cubes (see the
clim4health_s2dv_cubes vignette for more details). More
complex use cases and examples are described in other vignettes
(clim4health_downscaling regarding downscaling
methodologies and clim4health_verification regarding the
verification methodologies in clim4health). Terms are
defined in the glossary in the vignette
clim4health_glossary.
You can install clim4health directly from CRAN or download the development version:
# Install from CRAN
install.packages("clim4health")
# Get the development version from Gitlab
devtools::install_git('https://earth.bsc.es/gitlab/ghr/clim4health.git')Once the package is installed, you can load it into your R session
using library(clim4health).
Vignettes loaded in the package can be accessed in R by typing
vignette("vignettename") (e.g.,
vignette("clim4health_overview")).
clim4health can load and transform climate data from the Copernicus Climate Data Store (CDS). Specifically, it can currently load hourly, daily, or monthly reanalysis on single levels (ERA5 and ERA5-Land), as well as seasonal forecast monthly statistics on single levels. clim4health provides the option to download this data directly from the CDS.
Additionally, clim4health is able to load climate
data stored in a csv format (for example, weather station data). The
data should have a column for the date, as well as a column or columns
containing variable values. However, given the variety of formats used
to store climate data in csv files, the user may need to either reformat
the data before using clim4health’s load function, or follow the guide
for formatting data into an s2dv_cube object (see the
clim4health_s2dv_cubes vignette for more information and a
worked example).
Figure 1: Structure of the clim4health package, outlining its functions and general functionality.
This vignette outlines one full example workflow using the clim4health package. The process is organized into three key stages: obtaining input data, transforming and processing the data, and preparing outputs.
To use clim4health functions, climate data must be loaded into an R session in a consistent format called an s2dv_cube. This object can be created by:
downloading data from the CDS using c4h_get(), and
then loading it using c4h_load().
loading data from a local file and transforming it into an
s2dv_cube either manually or using CSTools::s2dv_cube().
The clim4health_s2dv_cubes vignette provides a worked
example of how to do this for weather station data.
To use the rest of the clim4health functions, the data must be in an s2dv_cube format with the following dimensions:
dataset - the datasets that are loaded.
var - the variables that are loaded, e.g. temperature, precipitation, etc.
sdate - the forecast initialisation or start dates.
time - the forecast lead time.
ensemble - the model ensemble members.
one or two spatial dimensions, either:
latitude and longitude for gridded data
area for polygon data, e.g. administrative areas such as states or provinces
location for point data, e.g. weather station data
💡 Tip: the dimension dataset will always have size 1 in clim4health cases. It only exists because it is required for package dependencies in the downscaling and verification functions.
Once the data has been loaded as an s2dv_cube, the user can move on to the transformation and postprocessing steps, which are described in the next section.
There are several functions available to transform and process the data, which can be applied in any order depending on the required output. These include:
For the more complex functions c4h_downscale() and
c4h_verify(), vignettes are available
(clim4health_downscaling and
clim4health_verification) which provide more detailed
information about the methodologies available and relevant choices of
input arguments.
Once the data has been transformed and processed as required, the user can choose to plot the data, or convert it to a different data type, and save it.
In this example, we demonstrate how to use clim4health to load and downscale seasonal forecast data, and plot the data at various stages.
In this example, we will perform the following steps:
Step 1:
Step 2:
Step 3:
📝 NOTE: the order of operations performed here is illustrative only, and users should decide based on their requirements.
First, load the required libraries.
The data used in this example are the monthly mean 2m temperature forecast (issued in January 2025 for 3 months lead time) from ECMWF’s System 5.1 seasonal forecast, its corresponding hindcast, as well as reanalysis data from ERA5Land covering the same hindcast period (2010-2012).
📝 NOTE: in reality, a much longer hindcast period is required to downscale and verify seasonal forecasts, but a short period is used here for demonstration purposes.
# Load the forecast data
fcst_path <- system.file("extdata/forecast/", package = "clim4health")
fcst_path <- paste0(fcst_path, "/")
fcst <- c4h_load(fcst_path,
variable = "t2m",
year = 2025,
month = 1,
leadtime_month = "all",
ext = "nc")
# Print a summary of the values of the forecast data
summary(fcst$data)We can also load the hindcast and reanalysis data, which should be initialized in the same month as the forecast (e.g. January), but for previous years. The hindcast and reanalysis should have the same time period for the downscaling and verifications steps.
# Load the hindcast data
hindcast_path <- system.file("extdata/hindcast/", package = "clim4health")
hindcast_path <- paste0(hindcast_path, "/")
hcst <- c4h_load(hindcast_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")
reanalysis_path <- system.file("extdata/reanalysis/", package = "clim4health")
reanalysis_path <- paste0(reanalysis_path, "/")
# Load the reanalysis data
rean <- c4h_load(reanalysis_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")Print the dimensions of the hindcast and reanalysis data, to compare with the forecast.
Follow the clim4health_s2dv_cubes vignette for more
information on the structure of the clim4health s2dv_cube objects, as
well as details about the types of climate data used here.
Now that we have loaded the data, let’s check the units.
We saw above that the temperature data are stored in Kelvin, which is
common for climate model output and reanalyses due to the model physics.
We can convert the units to Celsius, which is more useful in health
impact studies, using the c4h_convert_units() function.
# Convert from Kelvin to Celsius
fcst <- c4h_convert_units(fcst, to = "celsius", var = "t2m", from = "K")
hcst <- c4h_convert_units(hcst, to = "celsius", var = "t2m", from = "K")
rean <- c4h_convert_units(rean, to = "celsius", var = "t2m", from = "K")
# Display the new units
print(c4h_convert_units(fcst, var = "t2m"))📝 NOTE: If you are using monthly mean data from CDS, take care with units, in particular for variables such as precipitation. For example, the precipitation variable to be downloaded from the ERA5-Land monthly mean dataset is called
total_precipitation, and the data is stored in units ofm. However, this does not refer to monthly total precipitation. Instead, it is the monthly mean of the daily total precipitation. In this case, you would also need to perform an additional transformation to obtain the monthly sum precipitation. Unfortunately, clim4health cannot differentiate between the identical units ofm, and so you would have to make this conversion manually (by typingoutput_data$data <- output_data$data * 30.44).
We can use c4h_plot() to plot the forecast, hindcast,
and reanalysis data to visualize the raw data.
Once the data has been loaded and explored, we can use clim4health functions to transform, downscale, and verify the data as required. In this example, we will spatially aggregate the data to municipalities, calibrate the forecast and hindcast data to the reanalysis data, and perform a forecast verification to assess the skill of the forecast model. We will also apply a threshold-based mask to the data to identify only grid cells that exceed a certain temperature threshold.
Typically, climate data is stored on grids of latitude and longitude,
particularly if it comes from a climate model or reanalysis. However,
for health impact studies, we often want to aggregate the data to
administrative areas such as municipalities, states, or provinces, so
that it can be aligned with recorded health data. We can use the
c4h_space() function to spatially aggregate the gridded
data to these administrative areas.
# Load the sf package to read in spatial data
library(sf)
# Region municipalities
munip_path <- system.file("extdata", "areas", "munip_vallecauca.gpkg",
package = "clim4health")
munip <- read_sf(munip_path)
# Spatial aggregation
hcst_aggr <- c4h_space(hcst, munip)
rean_aggr <- c4h_space(rean, munip)The climate datasets hcst_aggr, and
rean_aggr are now aggregated to the same spatial areas,
and, instead of containing latitude and
longitude dimensions, they now contain an area
dimension, which corresponds to the municipalities.
We can plot the aggregated data in the same way as the gridded data,
as c4h_space() stores the corresponding polygon geometries
in cube$attrs$area in the output s2dv_cube
cube, and c4h_plot() can utilize these
geometries for visualization.
c4h_downscale() allows the user to downscale, calibrate,
and bias adjust the forecast and hindcast data to the finer spatial
resolution of the reanalysis data. This is an important step for
providing forecast data at a finer spatial resolution, and may help
correct any systematic biases in the forecast model data.
📝 NOTE: see more information about the downscaling and calibrating capabilities in the
clim4health_downscalingvignette.
To use c4h_downscale(), the user must specify different
parameters depending on the required methodology. The key parameters
are:
downscale_function: the downscaling/calibration
method to use. Options are "Interpolation",
"Intbc", "Intlr", and "Analogs".
See the clim4health_downscaling vignette for more
information about these methods.
obs: the observational data.
exp: the experimental data (generally speaking, the
hindcast data. It should have the same time period as the observational
data, and is used to calculate the bias adjustment parameters).
exp_cor: the experimental data to which the bias
adjustment parameters will be applied (generally speaking, the forecast
data). If it is NULL, the bias adjustment parameters will be applied to
exp instead. This is useful if the user wants to apply the
bias adjustment parameters to a different dataset (e.g. forecast data)
than the one used to calculate the bias adjustment parameters
(e.g. hindcast data).
First, let’s downscale the forecast data.
cal <- c4h_downscale(downscale_function = "Intbc", # downscaling method
obs = rean, # observational data, used to bias adjust the experiment
exp = hcst, # the experimental data to calculate the bias adjustment parameters
exp_cor = fcst, # the experimental data to which the bias adjustment parameters will be applied
method_bc = "bias", # for "Intbc", the bias correction method to use
method_remap = "bilinear") # the method to remap the data to the new grid📝 NOTE: you may see warnings when running
c4h_downscale(). These are completely normal. For example, if you are using ERA5-Land reanalysis near the ocean, you will have many NA values, which may generate warnings.
c4h_downscale() returns a list of two s2dv_cubes:
cal$exp contains the downscaled/calibrated
experimental data (in this case, the forecast data),
cal$obs contains the observational data.
📝 NOTE: in many cases,
cal$obswill be the same asrean_aggr(the inputobs), but there are certain cases where it may differ. See theclim4health_downscalingvignette for more information.
We can also downscale the hindcast data, which is used in the
verification step to assess the skill of the forecast model. In this
case, we leave the exp_cor parameter as NULL.
cal_hcst <- c4h_downscale(downscale_function = "Intbc",
obs = rean,
exp = hcst,
method_bc = "bias",
method_remap = "bilinear",
loocv = FALSE)Plot the downscaled forecast and hindcast data to visualise the results of the downscaling.
When using subseasonal to decadal predictions, we use hindcasts to
assess the skill of the predictions. We can use
c4h_verify() to perform a forecast verification and quality
assessment of the forecast model, by comparing the downscaled hindcast
data to the reanalysis data. This will give us an indication of how
skillful the forecast model is at predicting the observed values.
📝 NOTE: there are many different ways to perform a skill assessment or verification. See the
clim4health_verificationvignette for more information.
c4h_verify() requires the user to specify a selection of
metrics using the parameter metrics, as well the
experimental data (exp) and the observational data
(obs). Further parameters can be specified depending on the
selected metrics. The possible metrics to select are explored in detail
in the clim4health_verification vignette, but for this
example, we will calculate the Continuous Ranked Probability Skill Score
(CRPSS) of the forecast model compared to a reference forecast of
climatology.
skill <- c4h_verify(exp = cal_hcst$exp, # the experimental data (downscaled hindcast data)
obs = cal_hcst$obs, # the observational data (downscaled hindcast data)
metrics = c("CRPSS")) # the desired metrics to calculatec4h_verify() returns a list of lists, where each element
of the outer list corresponds to a metric, and each element of the inner
list contains both the values of the metric, its significance, and in
some cases additional information such as the upper and lower confidence
intervals.
For example, skill contains only one element,
skill$CRPSS, which is a list containing the values of CRPSS
in skill$CRPSS$crpss and the significance of the CRPSS
values in skill$CRPSS$sign. We can plot the skill of the
forecast model using c4h_plotskill(). This differs from
c4h_plot() in that it is specifically designed to plot
skill scores, and can add a layer to the plot to indicate the
significance of the skill values.
We can also explore the significance of the skill values by adding a
layer to the plot using the sign parameter.
📝 NOTE: in this case, there are no significant values! Don’t panic. This is because we only loaded three years of data to compare the hindcast and reanalysis, which is not enough to produce any significant result.
c4h_index() allows the user to apply threshold-based
indices to the data. For example, we can apply a mask to the forecast
data to identify only grid cells that exceed a certain temperature
threshold, which may be relevant for health impact studies (e.g. where
certain vectors are only active above a certain temperature threshold).
The user can specify:
data: the data to which the index will be
applied.
return_mask: whether to return a 0-1 mask of where
the threshold is exceeded, or to return the original data with values
that do not exceed the threshold masked to NA.
lower_threshold: the lower threshold to apply to the
data. Values below this threshold will be masked.
upper_threshold: the upper threshold to apply to the
data. Values above this threshold will be masked.
closed: whether the interval should be open or
closed (i.e. if the threshold limits should be included in the valid
interval or not).
For example, we can apply c4h_index() to the downscaled
forecast data to identify only grid cells where the temperature exceeds
20ºC, and then we can plot this masked dataset.
c4h_time() allows the user to aggregate their climate
data to different temporal resolutions. For example, the station data
included within clim4health contains daily data, but
perhaps the user would like to use weekly data instead, to align with
health data recorded by epiweek.
The user must specify:
data: the data to which the temporal aggregation
will be applied.
time_aggregation: the type of transformation to be
applied, e.g. "weekly".
dim_aggregation: the dimension across which the
temporal aggregation will be applied, e.g. "sdate" or
"time".
fun: the function to apply to the data for the
temporal aggregation, e.g. "mean" or
"max".
week_start: if time_aggregation is
"weekly", the day on which the week starts,
e.g. "Monday".
Let’s load some daily station data to demonstrate this function.
station_path <- system.file("extdata/stations/",
package = "clim4health")
station_data <- c4h_load(station_path, variable = "temp_mean",
ext = "csv")Since we have not specified any desired dates in our
c4h_load() function, all available data from the csv file
has been loaded into the time dimension, and the
sdate dimension has length 1.
We can use c4h_time() to aggregate the data to obtain
weekly mean daily-mean temperatures, by applying the mean function
across the time dimension.
weekly_data <- c4h_time(data = station_data,
time_aggregation = "weekly",
dim_aggregation = "time",
fun = "mean",
week_start = "Monday")If we wanted to obtain a climatology, we could load the data with
years in the sdate dimension, and then apply
c4h_time() to calculate the mean across the
sdate dimension.
station_data <- c4h_load(station_path, variable = "temp_mean",
ext = "csv", year = 2010:2012, month = 1, leadtime_month = 1:12)
print(station_data$dims)station_data now contains the sdate
dimension with length 93, which corresponds to the 93 start dates
specified (year = 3 x month = 1 x day = 31), and time
dimension with length 12, which corresponds to the 12 leadtime months
specified. We can apply c4h_time() to calculate the mean
across the sdate dimension to obtain a climatology (one
value for each month).
Climate data is generally multi-dimensional, and sometimes it can be
useful to collapse the data across one or more dimensions to obtain a
single value. For example, we may want to calculate the ensemble median
or mean across all ensemble members, or the mean value across years
(using the sdate dimension). We can use
c4h_collapse() to do this.
climatology <- c4h_collapse(data = yearly_stat,
dim = "sdate",
fun = "mean")
print(climatology$dims)We now have our climatology data, which has one value for each month.
Once the data has been transformed as required, we can use clim4health functions to convert to different formats, and then save any of these.
Once all the desired processing has been done, you can convert the
data from an s2dv_cube to a different data type using
c4h_convert(). It takes an s2dv_cube as input, and a string
containing the desired output type - SpatRaster objects
("terra"), stars ("stars"), data frames
("data.frame"), or polygon objects ("sf"). You
may also choose to use the flags crs to identify the
coordinate reference system for the output and drop to
remove any singleton dimensions.
Finally, you can save the data using c4h_save(). This
function allows you to save the processed data in a variety of formats,
determined by the requested file extension. Valid file extensions
include "nc", "tif", "csv",
"gpkg", and "shp", depending on whether the
data to be saved is gridded, areal, or point data.
📝 NOTE: We recommend using
c4h_save()at the end of your analysis, when the data is in its final format. If you think you will want to apply further clim4health functions later, we recommend saving an intermediate file usingsaveRDS()and loading it later usingreadRDS().