Title: | Validation of Seasonal Weather Forecasts |
---|---|
Description: | Provides tools for processing and evaluating seasonal weather forecasts, with an emphasis on tercile forecasts. We follow the World Meteorological Organization's "Guidance on Verification of Operational Seasonal Climate Forecasts", S.J.Mason (2018, ISBN: 978-92-63-11220-0, URL: <https://library.wmo.int/idurl/4/56227>). The development was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 869730 (CONFER). A comprehensive online tutorial is available at <https://seasonalforecastingengine.github.io/SeaValDoc/>. |
Authors: | Claudio Heinrich-Mertsching [aut, cre, cph] , Celine Cunen [ctb], Michael Scheuerer [ctb] |
Maintainer: | Claudio Heinrich-Mertsching <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.2.0 |
Built: | 2024-11-12 06:49:07 UTC |
Source: | CRAN |
The climatology is the average over years (and members for ensemble forecases), taken separately for each month, season, and coordinate. By default, the average is taken over all years in the data table, but you can change this using the years-argument. By default, climatologies (averages) are calculated for each column that is not recognized as dimension variable and does not contain characters.
add_climatology(dt, data_cols = NULL, years = NULL, by = dimvars(dt))
add_climatology(dt, data_cols = NULL, years = NULL, by = dimvars(dt))
dt |
the data table. |
data_cols |
For which columns do you want to derive the climatology? The default i |
years |
The average over which years should be considered as climatology. The default is all years in dt. |
by |
column names to group by. |
The provided data table with an extra climatology column
dt = add_climatology(chirps_monthly)
dt = add_climatology(chirps_monthly)
This is a synonyme for add_country_names
.
Following a more intuitive naming convention, that is more in-line
with add_climatology
and add_tercile_cat
.
add_country(dt, regions = EA_country_names())
add_country(dt, regions = EA_country_names())
dt |
the data table. |
regions |
Character vector of country names for which shapefiles are loaded. |
The provided data table with an extra column with country names
dt = add_country(chirps_monthly)
dt = add_country(chirps_monthly)
Takes a data table with lon/lat coordinates and adds a column 'country' to it, containing the name of the country, the coordinate belongs to.
add_country_names(dt, regions = EA_country_names())
add_country_names(dt, regions = EA_country_names())
dt |
the data table. |
regions |
Character vector of country names for which shapefiles are loaded.
By default, countries in East Africa are loaded, see |
The provided data table with an extra column with country names
dt = add_country_names(chirps_monthly)
dt = add_country_names(chirps_monthly)
Given a data table with multiple years of data, this function derives the tercile category per year. It first derives terciles for the data and then returns, for each row, a -1 if the data falls into the lowest tercile, 0 if it falls between 1st and second tercile, and +1 if it falls above the third tercile. Allows grouping by levels (e.g. months and location-coordinates): Tercile categories are derived separately for each level.
add_tercile_cat( dt, datacol = NULL, years = NULL, by = setdiff(dimvars(dt), c("year", "member")) )
add_tercile_cat( dt, datacol = NULL, years = NULL, by = setdiff(dimvars(dt), c("year", "member")) )
dt |
the data table. |
datacol |
Name of the column where the data is stored. If NULL, the function guesses. |
years |
Optional, if provided only these years are used for establishing climatology terciles. |
by |
names of columns to group by. |
The provided data table with an extra column tercile_cat
dt = add_tercile_cat(chirps_monthly)
dt = add_tercile_cat(chirps_monthly)
Adds columns 'below', 'normal' and 'above', containing predicted tercile probabilities, to a data table with ensemble forecasts.
The predicted probability is always the fraction of members ending up in the respective tercile.
The data table should either already have a column 'tercile_cat' (added by add_tercile_cat
),
or add_tercile_cat
will be run first.
add_tercile_probs(dt, f = NULL, by = setdiff(dimvars(dt), "member"), ...)
add_tercile_probs(dt, f = NULL, by = setdiff(dimvars(dt), "member"), ...)
dt |
the data table. |
f |
name of the column containing the forecast. |
by |
names of columns to group by |
... |
passed on to |
The provided data table, with added columns 'above', 'normal', and 'below'
dt = add_tercile_probs(ecmwf_monthly)
dt = add_tercile_probs(ecmwf_monthly)
Auxiliary function, used for checking whether spatial grids are regular, with allowing for rounding errors.
are_all_elements_within_eps(x, y, eps)
are_all_elements_within_eps(x, y, eps)
x |
A numeric vector, sorted in increasing order. |
y |
A numeric vector, sorted in increasing order. |
eps |
The tolerance within which we consider two values to be equal. |
A boolean value, TRUE if all x are in y within tolerance eps, FALSE otherwise.
returns the default column names to group by when calculating scores of ensemble forecasts.
by_cols_ens_fc_score(dt = NULL)
by_cols_ens_fc_score(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of grouping variables in this data table. |
A vector of characters with the column names.
returns the default column names to group by when calculating scores for tercile forecasts.
by_cols_terc_fc_score(dt = NULL)
by_cols_terc_fc_score(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of grouping variables in this data table. |
A vector of characters with the column names.
Gets column names to group by when calculating scores for tercile forecasts. Some tercile forecasts, such as ROC score or SRC (slope of reliability curve) require many data points and should therefore be pooled in space. This auxiliary function returns the default column names to group by for these scores. The suffix _sp stands for spatial pooling.
by_cols_terc_fc_score_sp(dt = NULL)
by_cols_terc_fc_score_sp(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of grouping variables in this data table. |
A vector of characters with the column names.
Checks whether the data table contains columns with names that are not allowed, or whether it is missing columns that are required.
checks_ens_fc_score()
checks_ens_fc_score()
Checks whether the data table contains columns with names that are not allowed, or whether it is missing columns that are required.
checks_terc_fc_score()
checks_terc_fc_score()
Auxiliary function to access/set the directory for loading and saving CHIRPS data.
chirps_dir(dir = file.path(data_dir(), "CHIRPS"))
chirps_dir(dir = file.path(data_dir(), "CHIRPS"))
dir |
The directory |
The directory path.
if(interactive()){chirps_dir()}
if(interactive()){chirps_dir()}
This dataset contains observed monthly mean precipitation for the greater horn of Africa, for November - December 1991-2020. The unit of precipitation is mm/day. It also contains the tercile category, where -1 means below normal rainfall (lowest tercile for this location and month), 0 is normal and 1 is above normal.The data source is CHIRPS-blended, upscaled to a half-degree grid.
data(chirps_monthly)
data(chirps_monthly)
An object of class data.table
(inherits from data.frame
) with 209040 rows and 6 columns.
http://iridl.ldeo.columbia.edu/SOURCES/.UCSB/.CHIRPS/.v2p0/.monthly/.global/.precipitation/
Calculates and saves the quantiles of CHIRPS data required for verification maps.
chirps_ver_map_quantiles( clim_period = 1991:2020, version = "UCSB", resolution = "low", CHIRPS_dir = chirps_dir(), seasons = TRUE )
chirps_ver_map_quantiles( clim_period = 1991:2020, version = "UCSB", resolution = "low", CHIRPS_dir = chirps_dir(), seasons = TRUE )
clim_period |
which years should be considered for the quantiles. |
version |
which version of CHIRPS, 'UCSB' or 'ICPAC'? Can be a vector with both. |
resolution |
If this is set to 'high', the quantiles are also calculated for high-resolution CHIRPS data. This is not nicely implemented right now and will take a lot of memory and time. |
CHIRPS_dir |
directory the CHIRPS data is stored in. |
seasons |
Are we plotting for seasonal or monthly forecasts? |
data table with quantiles.
## Not run: chirps_ver_map_quantiles()
## Not run: chirps_ver_map_quantiles()
for a given year, the ensemble forecast simply consists of the observations in all other years. This is essentially an auxiliary function for computing skill scores relative to climatology.
climatology_ens_forecast(obs_dt, by = setdiff(dimvars(obs_dt), "year"))
climatology_ens_forecast(obs_dt, by = setdiff(dimvars(obs_dt), "year"))
obs_dt |
Data table containing observations, must contain a column 'year'. |
by |
character vector containing the column names of the grouping variables, e.g. |
Long data table with the typical ensemble-forecast looks, i.e. containing a column 'member'.
dt = climatology_ens_forecast(chirps_monthly)
dt = climatology_ens_forecast(chirps_monthly)
The climatological prediction for exceedence probabilities is the fraction of observed years where the observation exceeded the threshold. It's calculated from leave-one-year-out climatology.
climatology_threshold_exceedence( obs_dt, o = "prec", by = setdiff(dimvars(obs_dt), "year"), thresholds = c(200, 300, 350, 400) )
climatology_threshold_exceedence( obs_dt, o = "prec", by = setdiff(dimvars(obs_dt), "year"), thresholds = c(200, 300, 350, 400) )
obs_dt |
Data table containing observations. |
o |
column name of the observation. Mostly observed precipitation in mm. |
by |
By which columns should be grouped? |
thresholds |
vector of thresholds for which the exceedence probabilities should be derived. |
Data table with the climatological probabilities of exceedence for the provided thresholds.
dt = climatology_threshold_exceedence(chirps_monthly)
dt = climatology_threshold_exceedence(chirps_monthly)
Function for combining two data tables, e.g. with predictions and observations.
This is a user-friendly wrapper for merge
. It guesses the columns to merge by (the dimension variables
contained in both data tables) and adds some warnings when merges are attempted that are likely not correctly specified by the user.
combine(dt1, dt2, ...)
combine(dt1, dt2, ...)
dt1 |
first data table |
dt2 |
second data table |
... |
passed on to data.table::merge |
The merged data table
# merge ECMWF-forecasts and CHIRPS observations: dt = ecmwf_monthly[month == 11] setnames(dt,'prec','forecast') # forecasts and observations both have a column 'prec' dt_new = combine(dt,chirps_monthly)
# merge ECMWF-forecasts and CHIRPS observations: dt = ecmwf_monthly[month == 11] setnames(dt,'prec','forecast') # forecasts and observations both have a column 'prec' dt_new = combine(dt,chirps_monthly)
First checks whether the spatial coordinates in a data table are part of a regular grid.
If they are, the function returns the smallest regular complete grid including all coordinates.
See set_spatial_grid
for more information.
complete_regular_grid(dt)
complete_regular_grid(dt)
dt |
A data table object containing the spatial grid with coordinates. |
A data table with the completed spatial grid. Has the grid-attribute.
dt = data.table(lon = c(1, 2, 3), lat = c(1, 2, 3)) completed_grid = complete_regular_grid(dt) print(completed_grid)
dt = data.table(lon = c(1, 2, 3), lat = c(1, 2, 3)) completed_grid = complete_regular_grid(dt) print(completed_grid)
Converts monthly to seasonal data. The function default values are set for precipitation. In particular, default behavior is to sum
values over all months contained in a season. If you want to average instead (for temperature, for example), you can change the aggregation function FUN
.
convert_monthly_to_seasonal( dt, vars = NULL, by = NULL, FUN = sum, ..., seasons = c("MAM", "JJAS", "OND"), only_complete_seasons = TRUE )
convert_monthly_to_seasonal( dt, vars = NULL, by = NULL, FUN = sum, ..., seasons = c("MAM", "JJAS", "OND"), only_complete_seasons = TRUE )
dt |
A data table containing the values for conversion. |
vars |
Character vector of names of the columns containing the values for conversion. Default is to try converting everything that is not contained in |
by |
Character vector of column names to group by. Separate values are derived for each unique combination of values in |
FUN |
function for aggregation. |
... |
arguments passed to |
seasons |
Vector of character strings specifying seasons. See details. Defaults to |
only_complete_seasons |
Logical. If |
Note that it is impossible to derive seasonal tercile categories from monthly ones (and similar for seasonal tercile forecasts). For getting these, you should convert to seasonal
before deriving the tercile categories or forecasts, e.g. by using add_tercile_cat()
or tfc_from_efc()
.
Seasons are provided as sequences of capitalized initial characters of the months they contain, e.g. 'MAM'
for March-April-May.
They can have any length from 1 to 12 months and are allowed to overlap and wrap over the end of the year
(for example, you can provide seasons = c('OND', 'NDJ')
to derive values for October-December and November-January).
If a season includes months from 2 years, it gets assigned the year of its starting month. For example, season = 'NDJ'
and year = 2000
refers to values for the season November 2000 to January 2001.
Factor- or Character-valued columns cannot be aggregated to seasonal values. If vars
contains columns that are factor- or character-valued, it checks whether they take a unique value for each grouping level
provided in by
. If yes, they are kept, else they are discarded. A typical case where this is useful is when your data table contains country names (see add_country()
).
The grouping levels usually include 'lon'
, 'lat'
, so there is only one country name per grouping level and the name is kept.
# returns empty data table, because the example data does not contain data for a full season: dt = convert_monthly_to_seasonal(chirps_monthly) # remove terc_cat first to avoid the warning, # and set season to the months we actually have data for: dt2 = convert_monthly_to_seasonal(copy(chirps_monthly)[,terc_cat := NULL], seasons = c('ND')) print(dt2) # season OND, get monthly averages rather than sums, and force the function to derive values # even though we do not have October-data: dt3 = convert_monthly_to_seasonal(chirps_monthly, seasons = c('OND'), FUN = mean, only_complete_seasons = FALSE) print(dt3)
# returns empty data table, because the example data does not contain data for a full season: dt = convert_monthly_to_seasonal(chirps_monthly) # remove terc_cat first to avoid the warning, # and set season to the months we actually have data for: dt2 = convert_monthly_to_seasonal(copy(chirps_monthly)[,terc_cat := NULL], seasons = c('ND')) print(dt2) # season OND, get monthly averages rather than sums, and force the function to derive values # even though we do not have October-data: dt3 = convert_monthly_to_seasonal(chirps_monthly, seasons = c('OND'), FUN = mean, only_complete_seasons = FALSE) print(dt3)
Function for calculating coefficients of predictive ability (CPAs) of ensemble mean forecasts stored in long data tables:#' Can also handle point forecasts. Warning: This metric always needs several years of data since the ranks on which it is based are calculated across multi-year samples.
CPA( dt, f, o = "obs", by = by_cols_ens_fc_score(dt), pool = "year", mem = "member", dim.check = TRUE )
CPA( dt, f, o = "obs", by = by_cols_ens_fc_score(dt), pool = "year", mem = "member", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column name of the prediction. |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. A separate CPA is computed for each value of the grouping variables. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged. Needs to contain 'year' per warning above. |
mem |
Number of column containing the number of the ensemble member. |
dim.check |
Logical. If True, a simple test whether the dimensions match up is conducted: The data table should only have one row for each level of c(by,pool,mem) |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CPA(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CPA(dt,f = 'fc')
Only works for functions that return a single plot if by == NULL. This is not the case for some functions plotting results for all three categories, e.g. reliability diagrams or ROC curves.
create_diagram_by_level(FUN, by, dt, ...)
create_diagram_by_level(FUN, by, dt, ...)
FUN |
The name of the function creating the diagram |
by |
Column names in dt to group by |
dt |
data table (cannot be part of ..., because a sub-data-table is passed to FUN) |
... |
arguments passed to FUN |
Taking CRPSs of ensemble forecasts stored in long data tables:
CRPS( dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = "year", mem = "member", dim.check = TRUE, ens_size_correction = FALSE )
CRPS( dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = "year", mem = "member", dim.check = TRUE, ens_size_correction = FALSE )
dt |
Data table containing predictions and observations. |
f |
column name of the forecasts. May not be called |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) over which is averaged. Typically just 'year'. |
mem |
Name of the column identifying the ensemble member. |
dim.check |
Logical. If True, a simple test whether the dimensions match up is conducted: The data table should only have one row for each level of c(by,pool,mem) |
ens_size_correction |
logical. If TRUE, the CRPS is corrected for sample size (see Ferro et al. 2008: 'On the effect of ensemble size on the discrete and continuous ranked probability scores'). This is slower, but you should do it if you compare ensembles of different size. |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CRPS(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CRPS(dt,f = 'fc')
Mostly copy-paste from scoringRules:::crps_edf
. Adjusted to the data table format, where the observation is a vector of the same length as the ensemble forecast,
but is just repeated (which is why only y[1]
) is used.
crps_aux(y, dat)
crps_aux(y, dat)
y |
vector of length m with m identical entries, the observation |
dat |
vector of length m containing the m ensemble forecasts |
Mostly copy-paste from scoringRules::crps_edf
. Adjusted to the data table format, where the observation is a vector of the same length as the ensemble forecast,
but is just repeated (which is why only y[1]
) is used.
crps_aux_esc(y, dat)
crps_aux_esc(y, dat)
y |
vector of length m with m identical entries, the observation |
dat |
vector of length m containing the m ensemble forecasts |
Function for taking CRPS skill scores of ensemble forecasts stored in long data tables. The skill score needs a climatological forecast as reference. This is so far always based on the leave-one-year-out climatology.
CRPSS(dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = c("year"), ...)
CRPSS(dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = c("year"), ...)
dt |
Data table containing predictions and observations. |
f |
column name of the prediction. |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. A separate CRPS is computed for each value of the grouping variables. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged. Needs to contain 'year' since the reference climatology forecast is leave-one-year-out. |
... |
passed on to CRPS_ens_fc, in particular mem and dim.check |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CRPSS(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) CRPSS(dt,f = 'fc')
The package allows to download and organize CHIRPS data. This function specifies the directory where the data is stored. The first time this function is called, it asks the user to configure the directory.
data_dir(set_dir = FALSE)
data_dir(set_dir = FALSE)
set_dir |
logical. Set this to TRUE if you have to redefine your data directory. |
The current data directory as string.
if(interactive()){ data_dir() }
if(interactive()){ data_dir() }
Auxiliary function cleaning out the directories, called at the end of the CHIRPS download.
delete_redundant_files(dir)
delete_redundant_files(dir)
dir |
the directory of the high dimensional CHIRPS data. |
The function returns all names currently considered dimension variables. Following the logic of netcdfs, data tables usually have columns specifying coordinates (or dimvars) and other columns containing data for these dimvars. Dimension variables can be spatial or temporal coordinates, or the lead time of a forecast or the member in an ensemble forecast, etc...
dimvars(dt = NULL)
dimvars(dt = NULL)
dt |
Optional data table. If a data table is provided only the dimvars of the data table are returned. |
A vector of characters with the column names considered dimvars.
dimvars()
dimvars()
Calculate the Generalized discrimination score from a data.table with data belonging to a single group (as defined by the by variable in the DISS function), for example a single location and month. Formula (5a) from Mason&2018 is used in the calculation. Mostly auxiliary function for the DISS function.
disc_score_dt(year, obs, pB, pN, pA)
disc_score_dt(year, obs, pB, pN, pA)
year |
a vector of pool variables, typically year. |
obs |
a vector of observations the observation column, needs to contain -1 if it falls into the first category, 0 for the second and 1 for the third category. |
pB |
a vector of probabilities for the first category. |
pN |
a vector of probabilities for the second category. |
pA |
a vector of probabilities for the third category. |
A data table with the scores
disc_score_dt(year = 1999:2001, obs = c(-1,0,0), pB = c(0.5,0.3,0), pN = c(0.3,0.3,0.7), pA = c(0.2,0.4,0.3))
disc_score_dt(year = 1999:2001, obs = c(-1,0,0), pB = c(0.5,0.3,0), pN = c(0.3,0.3,0.7), pA = c(0.2,0.4,0.3))
A generalisation of the ROC score for more than two categories. This score is not proper, but can be used to assess the discrimination of a tercile forecast.
DISS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(), pool = "year", dim.check = TRUE )
DISS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) DISS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) DISS(dt)
Download CHIRPS monthly data for the GHA-region and save it as netcdfs.
The data is downloaded either from the IRI data library or from ICPAC (depending on version
), because these data library allows to subset before downloading,
unlike the original source at UCSB.
As of Feb 2022, the entire CHIRPS-monthly data for the GHA-region is roughly 800MB on disk.
The original spatial resolution of CHIRPS is 0.05 degree lon/lat. However, for many applications a coarser resolution is perfectly fine.
The function therefore offers the option to also create and save a coarser, upscaled version of the CHIRPS data that allows much faster data processing.
Alternatively you can also ONLY save the upscaled version to save disk space (roughly 8MB on disk).
download_chirps_monthly( resolution = "both", update = TRUE, version = "UCSB", years = NULL, months = NULL, extent = GHA_extent(), timeout_limit = 300, upscale_grid = data.table(expand.grid(lon = seq(extent[1], extent[2], 0.5), lat = seq(extent[3], extent[4], 0.5))) )
download_chirps_monthly( resolution = "both", update = TRUE, version = "UCSB", years = NULL, months = NULL, extent = GHA_extent(), timeout_limit = 300, upscale_grid = data.table(expand.grid(lon = seq(extent[1], extent[2], 0.5), lat = seq(extent[3], extent[4], 0.5))) )
resolution |
Shall the data be upscaled? Takes one of three arguments:
|
update |
Logical, if TRUE, previously created files are skipped. |
version |
Should be 'UCSB' (for University of California Santa Barbara, the original source of CHIRPS) or 'ICPAC' (for downloading the ICPAC version CHIRPS blended) |
years , months
|
Which years and months do you want to load? NULL loads everything there is. |
extent |
vector of length four (xmin,xmax,ymin,ymax), restricting the spatial area. |
timeout_limit |
how many seconds (per file, i.e. per yearmonth) before the download is aborted? |
upscale_grid |
The coarse grid to which the data is upscaled (only used when resolution is either 'both' or 'high'). Only change this if you know what you are doing. |
Nothing.
if(interactive()){ download_chirps_monthly(years = 2020, months = 1) }
if(interactive()){ download_chirps_monthly(years = 2020, months = 1) }
Auxiliary function called by download_chirps_monthly
download_chirps_monthly_high( update, version, years, months, extent, timeout_limit, save_dir = file.path(chirps_dir(), version) )
download_chirps_monthly_high( update, version, years, months, extent, timeout_limit, save_dir = file.path(chirps_dir(), version) )
update , version , years , months , extent , timeout_limit
|
see |
save_dir |
directory where the chirps data is stored. |
Auxiliary function called by download_chirps_monthly
download_chirps_monthly_low( update, version, years, months, extent, timeout_limit, upscale_grid, root_dir = file.path(chirps_dir(), version) )
download_chirps_monthly_low( update, version, years, months, extent, timeout_limit, upscale_grid, root_dir = file.path(chirps_dir(), version) )
update , version , years , months , extent , timeout_limit
|
see |
upscale_grid |
To which grid shall we upscale? Needs a data table with lon/lat columns |
root_dir |
directory where the high-dimensional chirps data would be stored. The upscaled data is then stored in root_dir/upscaled/. |
This data becomes available earlier, but it has to be downloaded from UCSB. The function checks whether the non-preliminary version exists and only downloads otherwise. Annoyingly, the grid of UCBS and IRIDL are shifted against each other. Therefore this function also interpolates the UCSB data to the IRIDL grid, which makes it a bit slower. In particular, everything will crash if you have never downloaded a non-preliminary file and try to download a preliminary one.
download_chirps_prelim_aux( years, months, extent, timeout_limit = 300, nonprelim_dir = file.path(chirps_dir(), "monthly"), save_dir = file.path(nonprelim_dir, "prelim") )
download_chirps_prelim_aux( years, months, extent, timeout_limit = 300, nonprelim_dir = file.path(chirps_dir(), "monthly"), save_dir = file.path(nonprelim_dir, "prelim") )
years |
years for which you want to download |
months |
months for which you want to download |
extent |
Spatial window for downloading |
timeout_limit |
How many seconds before download is aborted. |
nonprelim_dir |
Directory where the non-preliminary CHIRPS data is stored. |
save_dir |
Directory where the function stores the preliminary data. |
nothing
if(interactive()){ download_chirps_prelim_aux(years = 2023, months = 10) }
if(interactive()){ download_chirps_prelim_aux(years = 2023, months = 10) }
This function writes a netcdf from a long data table, the usual data format in SeaVal. If not specified, it guesses (based on column names) which columns contain dimension variables and which contain variables. The function currently does not support writing netcdfs with multiple variables that have different sets of dimension variables!
It allows to store character columns in netcdfs (essentially labelling them as integers and storing a legend).
This legend is automatically interpreted when the netcdf is read with netcdf_to_dt()
, but is also humanly readable.
dt_to_netcdf( dt, nc_out, vars = NULL, units = NULL, dim_vars = dimvars(dt), dim_var_units = NULL, check = interactive(), description = NULL )
dt_to_netcdf( dt, nc_out, vars = NULL, units = NULL, dim_vars = dimvars(dt), dim_var_units = NULL, check = interactive(), description = NULL )
dt |
a data.table |
nc_out |
File name (including path) of the netcdf to write. |
vars |
names of columns in dt containing variables. If this is NULL, the function guesses and asks for confirmation. |
units |
character vector containing the units for vars (in the same order). If this is NULL (default), the user is prompted for input. |
dim_vars |
names of columns in dt containing dimension variables. If this is NULL, the function guesses and asks for confirmation. |
dim_var_units |
character vector containing the units for dim_vars (in the same order). If this is NULL (default), the user is prompted for input (except for lon/lat). |
check |
If check is TRUE, the function asks the user whether an existing file should be overwritten, and whether the correct dimvars have been guessed. |
description |
For adding a global attribute 'Description' as a string. |
none.
example_dt = data.table(lon = 1:3, month = 4:6, prec = 7:9) file_name = tempfile() dt_to_netcdf(dt = example_dt, nc_out = file_name, vars = "prec", units = "mm", dim_vars = c("lon","month"), dim_var_units = c("degree longitude","month"))
example_dt = data.table(lon = 1:3, month = 4:6, prec = 7:9) file_name = tempfile() dt_to_netcdf(dt = example_dt, nc_out = file_name, vars = "prec", units = "mm", dim_vars = c("lon","month"), dim_var_units = c("degree longitude","month"))
This is an auxiliary function used in add_country_names
, so only these names are recognized
by default.
EA_country_names()
EA_country_names()
A character-vector of country names.
EA_country_names()
EA_country_names()
This is a small example dataset containing hindcasts of monthly mean precipitation for illustration purposes. The forecasts are contained for the entire GHA-region, for November and December 2018-2020. The forecasts are issued by the ECMWF SEAS 5 model and initialized in August. The unit of precipitation is mm/day. Only the first 3 ensemble members are provided. The dataset also contains tercile probability forecasts, which are derived from the full 51 member ensemble. The probability for a tercile for a given year, month and location is always computed as the fraction of ensemble members falling into that tercile, computed from all ensemble predictions for the month and location under consideration. This dataset was generated using Copernicus Climate Change Service information (2020).
data(ecmwf_monthly)
data(ecmwf_monthly)
An object of class data.table
(inherits from data.frame
) with 37224 rows and 9 columns.
https://cds.climate.copernicus.eu
This score is suitable for tercile category forecasts. Using log2 for now (?). According to Mason, the averaging here should be over many years at a single locations and for discrete time-periods (so Mason prefers to take the average after averaging over different locations, but I keep it like this for now).
EIR( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
EIR( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) EIR(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) EIR(dt)
returns the columns names that are recognized as (ensemble-) forecast values
fc_cols(dt = NULL)
fc_cols(dt = NULL)
dt |
optional data table. If provided, the function guesses which column contains the forecast values. Else it returns all recognized forecast column names. |
Character vector with column names.
fc_cols()
fc_cols()
A gridpoint is masked for a given season (either 'MAM', 'JJAS' or 'OND'), if, on average, less than 10% of the annual total of rainfall occur during the season. This function loads CHIRPS data, and derives this mask as a data table of lon, lat coordinates, only containing the coordinates that shouldn't be masked. You can apply the mask to an existing data table using dt = combine(dt,mask).
get_mask( season, clim_years = 1990:2020, version = "UCSB", resolution = "low", us = (resolution == "low") )
get_mask( season, clim_years = 1990:2020, version = "UCSB", resolution = "low", us = (resolution == "low") )
season |
For which season do you want to calculate the mask? Needs to be either 'MAM', 'JJAS' or 'OND'. |
clim_years |
Numeric vector of years. Which years should be used to establish the mask? |
version , resolution , us
|
Passed to |
if(interactive()) get_mask('MAM')
if(interactive()) get_mask('MAM')
The quantiles are saved in/returned as a list with the following elements:
dt - A data table with quantiles for each level of by (not the same as the input-dt).
quantiles - the vector of quantiles that were used.
group - a data table containing the levels the quantiles are grouped over, e.g. all years the quantiles are calculated over.
data_col_name - the name of data_col, see below, so that you know what the quantiles actually were computed from.
description - the description string, if provided.
get_quantiles( dt, data_col = setdiff(names(dt), dimvars(dt))[1], qqs = c(10, 20, 33, 67, 80, 90), by = setdiff(dimvars(dt), c("year", "member")), description = NULL, save_file = NULL )
get_quantiles( dt, data_col = setdiff(names(dt), dimvars(dt))[1], qqs = c(10, 20, 33, 67, 80, 90), by = setdiff(dimvars(dt), c("year", "member")), description = NULL, save_file = NULL )
dt |
Data table containing the data. |
data_col |
The name of the column in dt containing the data for which the quantiles are derived. By default the first column that is not a dimension variable is selected. |
qqs |
Vector of quantiles. If one of them is larger 1 they are interpreted as percent. Default is the quantiles used in the verification maps. |
by |
Column names in dt. Levels by which the quantiles are calculated |
description |
Optional description string. |
save_file |
Optional name of save file. |
Nothing if save_file is provided. Otherwise the list described above
get_quantiles(chirps_monthly)
get_quantiles(chirps_monthly)
This function wraps get_quantiles
with the fixed quantiles 0.33 and 0.67.
get_terciles(...)
get_terciles(...)
... |
passed on to |
See get_quantiles
.
# takes a few seconds: get_terciles(chirps_monthly)
# takes a few seconds: get_terciles(chirps_monthly)
Plots spatial data from a data.table. The data table needs to contain columns named 'lon' and 'lat'. The grid needs to be regular.
If spatial data is contained for several levels (e.g. mutliple times or multiple ensemble members), only the data for the first level is plotted.
By default, the first column that is not recognized as a dimension variable is plotted, see data_col
. For the most common data-columns, reasonable
color scales are selected automatically.
ggplot_dt( dt, data_col = NULL, mn = NULL, discrete_cs = FALSE, rr = NULL, low = NULL, mid = NULL, high = NULL, name = data_col, midpoint = NULL, breaks = NULL, na.value = "gray75", oob = NULL, guide = guide_colorbar(barwidth = 0.5, barheight = 10), ..., binwidth = NULL, bin_midpoint = midpoint, add_map = TRUE, extent = NULL, expand.x = c(0, 0), expand.y = c(0, 0), dimension_check = TRUE )
ggplot_dt( dt, data_col = NULL, mn = NULL, discrete_cs = FALSE, rr = NULL, low = NULL, mid = NULL, high = NULL, name = data_col, midpoint = NULL, breaks = NULL, na.value = "gray75", oob = NULL, guide = guide_colorbar(barwidth = 0.5, barheight = 10), ..., binwidth = NULL, bin_midpoint = midpoint, add_map = TRUE, extent = NULL, expand.x = c(0, 0), expand.y = c(0, 0), dimension_check = TRUE )
dt |
Data table containing the data for plotting. |
data_col |
The name of the column in dt containing the data for plotting. If nothing is provided (the default), the first column that is not a dimension variable or 'member' is selected. |
mn |
optional plot title |
discrete_cs |
Logical. Should the color scale be discretized? |
rr , low , mid , high , name , breaks , na.value , oob , guide , ...
|
Arguments for the color scale, passed to scale_fill_gradient2 or scale_fill_steps2 (depending on whether discrete_cs == TRUE).
rr replaces limits (specifying the range of the color scale) for consistency with the older plotting functions from the PostProcessing package.
|
midpoint |
midpoint of the color scale, passed to |
binwidth , bin_midpoint
|
only used when |
add_map |
logical, defaults to |
extent |
An optional four-element vector in the order xmin,xmax,ymin,ymax for specifying the spatial extent of the plot. Default is to fit the extent to the data. |
expand.x , expand.y
|
vectors with two entries to be added to xlims/ylims of the plot. E.g. expand.x = c(-0.5,0.5) expands the plot by half a longitude both on the right and left hand side |
dimension_check |
Logical. By default the function checks that there are not multiple values per coordinate
(and subsets to the first level if there are several, e.g. to the first year and month (by appearance in |
a ggplot object.
Claudio Heinrich
ex_dt = chirps_monthly[lat <0 & month == 12 & year == 2020] pp = ggplot_dt(ex_dt) if(interactive()) plot(pp)
ex_dt = chirps_monthly[lat <0 & month == 12 & year == 2020] pp = ggplot_dt(ex_dt) if(interactive()) plot(pp)
Returns a lon/lat bounding box for the greater horn of Africa region. Format is c(xmin,xmax,ymin,ymax), as for raster::extent
GHA_extent()
GHA_extent()
A numeric vectorof length 4.
GHA_extent()
GHA_extent()
This essentially wraps ggplot_dt
, but uses a different map for borders.
The map is part of the package and is the one currently used during GHACOFs at ICPAC.
gha_plot(..., expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2)) ggplot_dt_shf(...) ggplot_dt_gha_map(...)
gha_plot(..., expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2)) ggplot_dt_shf(...) ggplot_dt_gha_map(...)
... , expand.x , expand.y
|
passed to |
dt = chirps_monthly[lon %between% c(30,40) & lat < 0 & month == 11 & year == 2020] pp = gha_plot(dt) if(interactive()) plot(pp)
dt = chirps_monthly[lon %between% c(30,40) & lat < 0 & month == 11 & year == 2020] pp = gha_plot(dt) if(interactive()) plot(pp)
This function prints out spatial grid information from a data table. If the grid-attribute does not exist
set_spatial_grid
is called first.
grid_info(dt)
grid_info(dt)
dt |
A data table |
This function does not return a value; instead, it prints a message to the console with the grid information.
dt = data.table(lon = runif(10), lat = runif(10)) grid_info(dt)
dt = data.table(lon = runif(10), lat = runif(10)) grid_info(dt)
This score is suitable for tercile category forecasts. This score is the frequency at which the highest probability category actually happens. The function also provides the frequency at which the second-highest probability category, and lowest probability category, actually happens.
HS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
HS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) HS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) HS(dt)
This score is suitable for tercile category forecasts. The skill score is the difference between the hit scores for the categories with the highest and lowest probabilities.
HSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
HSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1999:2001) print(dt) HSS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1999:2001) print(dt) HSS(dt)
This score is suitable for tercile category forecasts. Using log2 for now (?).
IGS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
IGS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) IGS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) IGS(dt)
This score is suitable for tercile category forecasts. Using log2 for now (?). This is the "usual" skill score (not the effective interest rate).
IGSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
IGSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) IGSS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) IGSS(dt)
Auxiliary function for multiplying two numbers such that 0 x infty is 0. Needed for the ignorance score: 0log(0) should be 0.
indicator_times_value_aux(indicator, value)
indicator_times_value_aux(indicator, value)
indicator |
logical input vector |
value |
numeric input vector |
indicator x value with 0*infty = 0
The data has to be previously downloaded, see download_chirps_monthly
. The resulting data table contains precip in unit mm/day.
load_chirps( years = NULL, months = NULL, version = "UCSB", resolution = "low", us = (resolution == "low"), load_prelim = TRUE )
load_chirps( years = NULL, months = NULL, version = "UCSB", resolution = "low", us = (resolution == "low"), load_prelim = TRUE )
years , months
|
Optional subset of years and months you want to load. The default is to load everything that has been downloaded locally. You can update your local CHIRPS download by calling download_chirps_monthly |
version |
Either 'UCSB' to load the original version from UCSB or 'ICPAC' to load CHIRPS blended (both need to be downloaded first). |
resolution |
Either 'low' for loading the coarser upscaled version, or 'high' for loading the data on full resolution |
us |
logical. If TRUE, the upscaled version is loaded. Takes precedence over resolution. |
load_prelim |
logical. Should preliminary data be loaded? Note that the preliminary data is always from UCSB, not from ICPAC. |
the derived data table
if(interactive()){ load_chirps() }
if(interactive()){ load_chirps() }
Data table column names that are recognized as leadtime
lt_cols()
lt_cols()
lt_cols()
lt_cols()
This score is suitable for tercile category forecasts.
MB( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
MB( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) MB(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) MB(dt)
This score is suitable for tercile category forecasts.
MBS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
MBS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) MBS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) MBS(dt)
Auxiliary function for checking dimensions for map-plotting
modify_dt_map_plotting(dt, data_col)
modify_dt_map_plotting(dt, data_col)
dt |
Data table containing the data for plotting |
data_col |
Name of column containing the data for plotting |
Converts time given as 'months since date' (MSD) into years and months (YM)
MSD_to_YM(dt, timecol = "time", origin = "1981-01-01")
MSD_to_YM(dt, timecol = "time", origin = "1981-01-01")
dt |
a data table. |
timecol |
name of the column containing the time. |
origin |
The time column contains time in the format month since which date? |
data table with two new columns 'month' and 'year', the timecol is deleted.
dt = MSD_to_YM(data.table(time = 0:12))
dt = MSD_to_YM(data.table(time = 0:12))
Derives the MSE of ensemble forecasts stored in long data tables. Can also handle point forecast.
MSE( dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = "year", mem = "member", dim.check = TRUE )
MSE( dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = "year", mem = "member", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column name of the prediction. |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
mem |
Name of the column identifying the ensemble member. Only used if check_dimension is TRUE. Is NULL for a point forecast. |
dim.check |
Logical. If True, a simple test whether the dimensions match up is conducted: The data table should only have one row for each level of c(by,pool,mem) |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) MSE(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) MSE(dt,f = 'fc')
Function for taking MSE skill scores of ensemble forecasts stored in long data tables. Can also handle point forecasts. The skill score needs a climatological forecast as reference. This is so far always based on the leave-one-year-out climatology.
MSES(dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = c("year"), ...)
MSES(dt, f, o = "obs", by = by_cols_ens_fc_score(), pool = c("year"), ...)
dt |
Data table containing the predictions. |
f |
column name of the prediction. |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. A separate MSE is computed for each value of the grouping variables. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged. Needs to contain 'year' since the reference climatology forecast is leave-one-year-out. |
... |
passed on to MSE |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) MSES(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) MSES(dt,f = 'fc')
The function converts netcdfs into long data.tables. Be aware that the data table can be much larger in memory, especially if you have many dimension variables.
netcdf_to_dt( nc, vars = NULL, verbose = 2, trymerge = TRUE, subset_list = NULL, keep_nas = FALSE )
netcdf_to_dt( nc, vars = NULL, verbose = 2, trymerge = TRUE, subset_list = NULL, keep_nas = FALSE )
nc |
Either a character string with the name of the .nc file (including path), or an object of type ncdf4. |
vars |
Which variables should be read from the netcdf? Either a character vector of variable names, or an integer vector of variable indices. If this is NULL, all variables are read. |
verbose |
Either 0, 1 or 2. How much information should be printed?
The default (2) is to print the entire netcdf information (as output by |
trymerge |
logical. If TRUE, a single data table containing all variables is returned, else a list of data tables, one for each variable. The latter is much more memory efficient if you have multiple variables depending on different dimensions. |
subset_list |
A named list for reading only subsets of the data. Currently only 'rectangle subsetting' is provided, i.e. you can provide two limit values for each dimension and everything between will be read. The names of the pages of subset_list must correspond to the names of dimension variables in the netcdf, and each page should contain a (two-element-)range vector. For example, subsetting a global dataset to just East Africa could look like this: subset_list = list(latitude = c(-15,25),longitude = c(20,55)). Non-rectangular subsetting during reading a netcdf seems to be difficult, see ncvar_get. Every dimension variable not named in subset_list is read entirely. |
keep_nas |
Should missing values be kept? If FALSE (the default), missing values are not included in the returned data table. If this is set to TRUE, the data table is constructed from the full data-cube (meaning its number of rows is the product of the length of the dimension variables, even if many coordinates have missing data). This makes the returned data table potentially much larger and is almost never an advantage. It is only allowed, because it can make complex bookkeeping tasks easier (specifically upscaling many CHIRPS-netcdfs with the same coordinates while saving the upscaling weights in a matrix). |
A data table if trymerge == TRUE
or else a list of data tables.
# filename of example-netcdf file: fn = system.file("extdata", "example.nc", package="SeaVal") dt = netcdf_to_dt(fn) print(dt)
# filename of example-netcdf file: fn = system.file("extdata", "example.nc", package="SeaVal") dt = netcdf_to_dt(fn) print(dt)
Note that this function guesses column names for observed precip, not observed tercile category.
obs_cols(dt = NULL)
obs_cols(dt = NULL)
dt |
optional data table. If provided, the function guesses which column contains the observations. Else it returns all recognized observation column names. |
Character vector with column names.
obs_cols()
obs_cols()
Observation dimvars are column names in a data table that resemble coordinates for which only one observation may exist.
obs_dimvars(dt = NULL)
obs_dimvars(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of coordinate columns in this data table. |
Character vector with column names.
obs_dimvars
obs_dimvars
Function for calculating Pearson correlation coefficients (PCCs) of ensemble mean forecasts stored in long data tables. Can also handle point forecasts. This metric always needs several years of data since the means and standard deviations are calculated across time.
PCC( dt, f, o = "obs", by = by_cols_ens_fc_score(dt), pool = "year", mem = "member", dim.check = TRUE )
PCC( dt, f, o = "obs", by = by_cols_ens_fc_score(dt), pool = "year", mem = "member", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column name of the prediction. |
o |
column name of the observations. |
by |
column names of grouping variables, all of which need to be columns in dt. A separate PCC is computed for each value of the grouping variables. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged. Needs to contain 'year' per warning above. |
mem |
Name of the column identifying the ensemble member. Only used if check_dimension is TRUE. Is NULL for a point forecast. |
dim.check |
Logical. If True, a simple test whether the dimensions match up is conducted: The data table should only have one row for each level of c(by,pool,mem) |
A data table with the scores
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) PCC(dt,f = 'fc')
dt = data.table(fc = 1:4,obs = c(4,4,7,7),member = c(1,2,1,2),year = c(1999,1999,2000,2000)) PCC(dt,f = 'fc')
These graphs really only make sense if you have 50 or less observations. Typical application would be when you compare seasonal mean forecasts to station data for a single location.
profit_graph( dt, accumulative = TRUE, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), dim.check = TRUE )
profit_graph( dt, accumulative = TRUE, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), dim.check = TRUE )
dt |
Data table containing tercile forecasts |
accumulative |
Logic. Should the accumulative profit be plotted or the profit per forecast? |
f |
column names of the prediction columns |
o |
column name of the observation column |
by |
column names of grouping variables. Default is NULL. |
pool |
column names of pooling variables (used for the dimension check). Default is all dimvars. |
dim.check |
Logical. If TRUE, the function checks whether the columns in by and pool span the entire data table. |
A list of gg objects which can be plotted by ggpubr::ggarrange (for example)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) p1 = profit_graph(dt) p2 = profit_graph(dt,accumulative = FALSE) if(interactive()){ plot(p1) plot(p2) }
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) p1 = profit_graph(dt) p2 = profit_graph(dt,accumulative = FALSE) if(interactive()){ plot(p1) plot(p2) }
Computes both the reliability component of the Brier score or reliability component of the Ignorance score. Mason claims to prefer the ignorance score version, but this has a very high chance of being NA. Mason writes that the scores are unstable for single locations and that one should pool over many locations. Requires the specification of probability bins. One score for each category (below, normal, above) and also the sum of the scores.
Values close to 0 indicate reliable forecasts. Higher values mean less reliable forecasts.
REL( dt, bins = c(0.3, 0.35001), f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
REL( dt, bins = c(0.3, 0.35001), f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
bins |
probability bins, defaults to ("<30", "30-35",">35") which is given as c(0.30, 0.35001). |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) REL(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) REL(dt)
Creates reliability diagrams from a data table containing tercile forecasts
It wraps rel_diag_vec
, see ?rel_diag_vec
for more details.
about the output diagrams. The output format is very much inspired by Figure 5 of Mason&2018. By default, 4 diagrams are drawn,
one for each the prediction of above-, normal- and below-values, plus one for all forecasts together.
You can provide a 'by' argument to obtain separate reliability diagrams for different values of the by-columns. E.g., when you data table contains
a column named 'season', you can set by = 'season'. Then, the function will output a list of 16 diagrams, 4 for each season.
rel_diag( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), binwidth = 0.05, dim.check = TRUE )
rel_diag( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), binwidth = 0.05, dim.check = TRUE )
dt |
Data table containing tercile forecasts |
f |
column names of the prediction columns |
o |
column name of the observation column |
by |
column names of grouping variables. Default is to not group. |
pool |
column names of pooling variables (used for the dimension check). Default is all dimvars. |
binwidth |
bin width for discretizing probabilities. |
dim.check |
Logical. If TRUE, the function checks whether the columns in by and pool span the entire data table. |
A list of gg objects which can be plotted by ggpubr::ggarrange (for example)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = rel_diag(dt) if(interactive()) plot(pp)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = rel_diag(dt) if(interactive()) plot(pp)
The probabilities have to be rounded beforehand (see round_probs
), because the diagram draws a point for each level of the probabilities. The diagram includes a histogram indicating
the forecast relative frequency for each probability bin. The diagram shows the reliability curve and the diagonal for reference.
Moreover, it shows a regression line fitted by weighted linear regression where the forecast relative frequencies are used as weights.
A horizontal and vertical line indicate the frequency of observation = TRUE over the entire dataset.
rel_diag_vec(discrete_probs, obs, slope_only = FALSE)
rel_diag_vec(discrete_probs, obs, slope_only = FALSE)
discrete_probs |
Vector of (rounded) probabilites. |
obs |
Vector of logical observations. |
slope_only |
logical. If set to TRUE, only the slope of the reliability curve is returned |
A gg object.
discrete_probs = seq(0,1,length.out = 5) obs = c(FALSE,FALSE,TRUE,TRUE,TRUE) pp = rel_diag_vec(discrete_probs,obs) if(interactive()) plot(pp)
discrete_probs = seq(0,1,length.out = 5) obs = c(FALSE,FALSE,TRUE,TRUE,TRUE) pp = rel_diag_vec(discrete_probs,obs) if(interactive()) plot(pp)
Computes both the resolution component of the Brier score or resolution component of the Ignorance score. Mason claims to prefer the ignorance score version, but this has a very high chance of being NA (much higher than for the full ignorance score itself, I think we should drop it for that reason). Mason writes that the scores are unstable for single locations and that one should pool over many locations. Requires the specification of probability bins. One score for each category (below, normal, above) and also the sum of the scores. Values close to 0 means low resolution. Higher values mean higher resolution.
RES( dt, bins = c(0.3, 0.35001), f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
RES( dt, bins = c(0.3, 0.35001), f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
bins |
probability bins, defaults to c("<30", "30-35",">35") |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) RES(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) RES(dt)
Restricts a dataset to one or more countries, specified by their names. If you have lon/lat data and don't know
which countries these coordinates belong to, see add_country_names
. Can restrict data to a rectangle around a given country
as well (usually looks nicer for plotting).
restrict_to_country(dt, ct, rectangle = FALSE, tol = 1)
restrict_to_country(dt, ct, rectangle = FALSE, tol = 1)
dt |
the data table. |
ct |
name of the country, or vector containing multiple country names |
rectangle |
logical. If FALSE (default), the data is restricted to the gridcells for which the centerpoint lies within the selected country (e.g. for computing mean scores for a country). If TRUE, the data is kept for a rectangle containing the entire country, therefore also containing gridpoints outside the country. This is the preferred option for plotting data for a specific country. |
tol |
Only used when |
the data table, restricted to the selected country
# example data: ex_dt = chirps_monthly[lat < 0 & month == 11 & year == 2020] dt = restrict_to_country(ex_dt,'Kenya')
# example data: ex_dt = chirps_monthly[lat < 0 & month == 11 & year == 2020] dt = restrict_to_country(ex_dt,'Kenya')
Wraps restrict_to_country
, and restricts to the GHA-region usually considered in CONFER, see EA_country_names
.
restrict_to_GHA(dt, ...) restrict_to_confer_region(...)
restrict_to_GHA(dt, ...) restrict_to_confer_region(...)
dt |
the data table. |
... |
passed on to |
the data table, restricted to the selected country
ex_dt = chirps_monthly[lat < 0 & month == 11 & year == 2020] dt = restrict_to_GHA(ex_dt)
ex_dt = chirps_monthly[lat < 0 & month == 11 & year == 2020] dt = restrict_to_GHA(ex_dt)
Creates ROC curves from a data table containing tercile forecasts. It wraps roc_curve_vec
.
By default, 4 ROC-curves are drawn, one for each the prediction of above-, normal- and below-values, plus one for all forecasts together.
You can provide a 'by' argument to obtain separate ROC-curves for different values of the by-columns. E.g., when your data table contains
a column named 'season', you can set by = 'season'. Then, the function will output a list of 16 ROC-curvess, 4 for each season.
ROC_curve( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), interpolate = TRUE, dim.check = TRUE )
ROC_curve( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), interpolate = TRUE, dim.check = TRUE )
dt |
Data table containing tercile forecasts |
f |
column names of the prediction columns |
o |
column name of the observation column |
by |
column names of grouping variables. Default is to not group. |
pool |
column names of pooling variables (used for the dimension check). Default is all dimvars. |
interpolate |
Logical. If TRUE, the curve connects the dots making up the ROC curve (which looks nicer), if not a step function is drawn (which is closer to the mathematical definition of the ROC curve). |
dim.check |
Logical. If TRUE, the function checks whether the columns in by and pool span the entire data table. |
A list of gg objects which can be plotted by ggpubr::ggarrange
(for example)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = ROC_curve(dt) if(interactive()) plot(pp)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = ROC_curve(dt) if(interactive()) plot(pp)
Plot the ROC-curve for a vector of probabilities and corresponding observations.
roc_curve_vec(probs, obs, interpolate = TRUE)
roc_curve_vec(probs, obs, interpolate = TRUE)
probs |
vector with probabilities (between 0 and 1) |
obs |
vector with categorical observations |
interpolate |
logical. If TRUE the ROC-curve is interpolated and drawn as a continuous function. Otherwise it is drawn as a step function. |
a gg object
probs = seq(0,1,length.out = 5) obs = c(FALSE,FALSE,TRUE,FALSE,TRUE) pp = roc_curve_vec(probs,obs) if(interactive()) plot(pp)
probs = seq(0,1,length.out = 5) obs = c(FALSE,FALSE,TRUE,FALSE,TRUE) pp = roc_curve_vec(probs,obs) if(interactive()) plot(pp)
Calculates the area under curve (AUC) or ROC-score from a vector of probabilities and corresponding observations. Formula (1a) from Mason&2018 is used in the calculation, corresponding to trapezoidal interpolation. This is mostly an auxiliary function for the ROCS function, but also used in the ROC-diagram function, where the AUC is added to the diagrams.
roc_score_vec(probs, obs)
roc_score_vec(probs, obs)
probs |
vector with probabilities (between 0 and 1) |
obs |
vector with categorical observations (as TRUE/FALSE) |
numeric. The ROC score.
roc_score_vec(probs = c(0.1,0.6,0.3,0.4), obs = c(FALSE,TRUE,TRUE,FALSE))
roc_score_vec(probs = c(0.1,0.6,0.3,0.4), obs = c(FALSE,TRUE,TRUE,FALSE))
This score is not proper, but can be used to assess the resolution of a tercile forecast. The ROC score requires more datapoints to be robust than e.g. the ignorance or Brier score. Therefore the default is to pool the data in space and only calculate one score per season.
ROCS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(dt), pool = c("year", space_dimvars(dt)), dim.check = TRUE )
ROCS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(dt), pool = c("year", space_dimvars(dt)), dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) ROCS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) ROCS(dt)
takes a vector of probabilities (between 0 and 1) and rounds them to the scale specified by binwidth. This is used for reliability diagrams, where one point is drawn for each bin. 0 is always at the center of the first interval for rounding: E.g. if binwidth = 0.05 (the default), then probabilities up to 0.025 are rounded to 0, probs between 0.025 and 0.075 are rounded to 0.05, etc.
round_probs(probs, binwidth = 0.05)
round_probs(probs, binwidth = 0.05)
probs |
vector of probabilities (between 0 and 1, not percent) |
binwidth |
width of the bins for rounding. |
vector with rounded probabilities
round_probs(c(0.001,0.7423))
round_probs(c(0.001,0.7423))
This score is suitable for tercile category forecasts.
RPS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
RPS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) RPS(dt)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) RPS(dt)
This score is suitable for tercile category forecasts.
RPSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
RPSS( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score(), pool = "year", dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
@examples dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) RPSS(dt)
called inside functions that calculate scores for ensemble forecasts. Checks whether the provided data table has the right format.
run_dimension_check_ens_fc_score()
run_dimension_check_ens_fc_score()
called inside functions that calculate scores for ensemble forecasts. Checks whether the provided data table has the right format.
run_dimension_check_terc_forecast()
run_dimension_check_terc_forecast()
Auxiliary function for decoding season-strings
season_strings_to_int(seasons = c("MAM", "JJAS", "OND"))
season_strings_to_int(seasons = c("MAM", "JJAS", "OND"))
seasons |
A character vector of season strings, see |
This function creates the spatial grid attribute for a data table. If the data table already has such an attribute, missing information is filled in. In particular, the function checks whether a grid is regular, allowing for rounding errors in the grid coordinates, see details below. By default the grid coordinates are rounded to a regular grid if they are very close to being regular. While this sounds dangerous, it is almost always desirable to treat coordinates like that when working with data tables.
set_spatial_grid( dt, coor_cns = NULL, check_regular = TRUE, regular_tolerance = 1, verbose = FALSE )
set_spatial_grid( dt, coor_cns = NULL, check_regular = TRUE, regular_tolerance = 1, verbose = FALSE )
dt |
A data table object. |
coor_cns |
Optional character vector of length two indicating the names of the spatial coordinates
within the data table in order |
check_regular |
A logical indicating whether to check for regularity of the grid. This should essentially always be done but can be suppressed for speed.
Defaults to |
regular_tolerance |
Value >= 0 specifying the amount of rounding error we allow for still recognizing a grid as regular.
Given in percent of the minimum of |
verbose |
Logical. If |
The grid attribute is a named list with (some of) the following pages:
coor_cns
: Character vector of length two specifying the names of the data-table-columns containing the spatial grids (in order x,y).
x,y
: Numeric vectors of all unique x- and y-coordinates in increasing order (NAs not included).
regular
: Logical. Is the grid regular? See details below.
dx,dy
: Step sizes of the regular grid (only contained if regular = TRUE
). By convention we set dx
to 9999 if only one x-coordinate is present, likewise for dy
.
complete
: Logical. Is the regular grid complete? See details below.
We call a grid regular if there is a coordinate (x0,y0)
and positive values dx
, dy
,
such that each coordinate of the grid can be written as (x0 + n*dx,y0 + m*dy)
for integers n
,m
.
Importantly, a regular grid does not need to be "a complete rectangle", we allow for missing coordinates, see details below.
We call it a regular complete grid if the grid contains these numbers for all integers n
, m
between some limits n_min
and n_max
,
respectively m_min
, m_max
.
Checking regularity properly is a difficult problem, because we allow for missing coordinates
in the grid and allow for rounding errors.
For the treatment of rounding errors it is not recommended to set regular_tolerance
to NULL
or a very small value
(e.g. 0.1 or smaller). In this case, grids that are regular in praxis are frequently not recognized as regular:
Take for example the three x-coordinates 1, 1.5001, 2.4999. They are supposed to be rounded to 1 digit after
the comma and then the grid is regular with dx = 0.5
. However, if regular_tolerance
is NULL, the grid will be marked as irregular.
Similarly, if regular_tolerance
is too small, the function is not allowed to make rounding errors of 0.0001
and the grid will also not be recognized as regular.
When it comes to the issue of missing values in the grid, we are (deliberately) a bit sloppy and only check whether
the coordinates are part of a grid with dx
being the minimum x
-difference between two coordinates,
and similar dy
. This may not detect regularity, when we have data that is sparse on a regular grid.
An example would be the three lon/lat coordinates c(0,0)
, c(2,0)
, c(5,0)
. They clearly lie on the regular integer-lon/lat-
grid. However, the grid would show as not regular, because dx
is not checked for smaller values than 2.
This choice is on purpose, since for most applications grids with many (or mostly) holes should be treated as irregular (e.g. plotting, upscaling, etc.).
The most important case of regular but not complete grids is gridded data that is restricted to a certain region, e.g. a country
or restricted to land. This is what we think of when we think of a regular incomplete grid, and for such data the check works perfectly.
Note that at the very bottom it is the definition of regularity itself that is a bit tricky:
If we allow dx
, dy
to go all the way down to the machine-delta,
then pretty much any set of coordinates represented in a computer is part of a regular grid.
This hints at testing and detecting regularity actually depending on how small you're willing to make your dx
,dy
.
An example in 1 dimension: consider the three 1-dimensional coordinates 0
, 1
, and m/n
, with m
and n
integers
without common divisors and m>n
. It is not difficult to see that these coordinates are part of a regular grid and that the
largest dx
for detecting this is 1/n. This shows that you can have very small coordinate sets that are in theory regular, but their regularity
can be arbitrarily hard to detect. An example of a grid that is truely not regular are the three x
-coordinates 0,1,a with a irrational.
Nothing, the attributes of dt are set in the parent environment. Moreover, the grid coordinates may be rounded If regular
dt = data.table(lon = 1:4, lat = rep(1:2,each = 2), some_data = runif(4)) print(dt) attr(dt,'grid') set_spatial_grid(dt) attr(dt,'grid')
dt = data.table(lon = 1:4, lat = rep(1:2,each = 2), some_data = runif(4)) print(dt) attr(dt,'grid') set_spatial_grid(dt) attr(dt,'grid')
returns all column names indicating a spatial coordinate.
space_dimvars(dt = NULL)
space_dimvars(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of spatial coordinate columns in this data table. |
Character vector with column names.
space_dimvars()
space_dimvars()
Values below 1 indicate a lack of resolution or overconfidence, 1 is perfect, above means underconfident. This score requires more datapoints to be robust than e.g. the ignorance or Brier score. Therefore the default is to pool the data in space and only calculate one score per season.
SRC( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(dt), pool = c("year", space_dimvars(dt)), dim.check = TRUE )
SRC( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = by_cols_terc_fc_score_sp(dt), pool = c("year", space_dimvars(dt)), dim.check = TRUE )
dt |
Data table containing the predictions. |
f |
column names of the prediction. |
o |
column name of the observations (either in obs_dt, or in dt if obs_dt = NULL). The observation column needs to
contain -1 if it falls into the first category (corresponding to |
by |
column names of grouping variables, all of which need to be columns in dt. Default is to group by all instances of month, season, lon, lat, system and lead_time that are columns in dt. |
pool |
column name(s) for the variable(s) along which is averaged, typically just 'year'. |
dim.check |
Logical. If TRUE, the function tests whether the data table contains only one row per coordinate-level, as should be the case. |
A data table with the scores
@examples dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), year = 1:3) print(dt) SRC(dt)
which column names are interpreted as observed tercile categories
tc_cols(dt = NULL)
tc_cols(dt = NULL)
dt |
optional data table. If provided, the function guesses which column contains the observations. Else it returns all recognized column names. |
Character vector with column names.
tc_cols()
tc_cols()
Tendency diagram from a data table containing tercile forecasts.
tendency_diag( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), dim.check = TRUE )
tendency_diag( dt, f = c("below", "normal", "above"), o = tc_cols(dt), by = NULL, pool = setdiff(dimvars(dt), by), dim.check = TRUE )
dt |
Data table containing tercile forecasts |
f |
column names of the prediction columns |
o |
column name of the observation column |
by |
column names of grouping variables. Default is to not group. |
pool |
column names of pooling variables (used for the dimension check). Default is all dimvars. |
dim.check |
Logical. If TRUE, the function checks whether the columns in by and pool span the entire data table. |
If by == NULL a gg object, otherwise a list of gg objects that can be plotted by ggpubr::ggarrange (for example)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = tendency_diag(dt) if(interactive()) plot(pp)
dt = data.table(below = c(0.5,0.3,0), normal = c(0.3,0.3,0.7), above = c(0.2,0.4,0.3), tc_cat = c(-1,0,0), lon = 1:3) print(dt) pp = tendency_diag(dt) if(interactive()) plot(pp)
Function for plotting terciles
tercile_plot( dt, data_col = tc_cols(dt), mn = NULL, low = "orange", mid = "cyan", high = "green1", name = "", labels = c("Wetter", "Average", "Drier"), na.value = "white", extent = NULL, expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2), dimension_check = TRUE )
tercile_plot( dt, data_col = tc_cols(dt), mn = NULL, low = "orange", mid = "cyan", high = "green1", name = "", labels = c("Wetter", "Average", "Drier"), na.value = "white", extent = NULL, expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2), dimension_check = TRUE )
dt |
data table |
data_col |
Name of the column containing the observed tercile category |
mn |
optional title for the plot. |
low , mid , high
|
colors for the three categories |
name |
optional title for the colorscale |
labels |
How to label the three categories |
na.value |
How to color missing values |
extent |
Optional vector of length 4 specifying the plotting borders in order xmin, xmax, ymin, ymax. |
expand.x , expand.y
|
How far should the plotting borders be extended (beyond the data range)? |
dimension_check |
Logical. By default the function checks that there are not multiple values per coordinate
(and subsets to the first level if there are several, e.g. to the first year and month (by appearance in |
dt = chirps_monthly[month == 12 & lat <0 & year == 2018] p = tercile_plot(dt = dt) if(interactive()) plot(p)
dt = chirps_monthly[month == 12 & lat <0 & year == 2018] p = tercile_plot(dt = dt) if(interactive()) plot(p)
The function takes a data table containing ensemble predictions and reduces it to predicted tercile probabilities.
The data table should either have a column 'tercile_cat' or it will be generated in the process (by add_tercile_cat
).
In particular, if you don't know the tercile category of the ensemble predictions, your data table should contain hindcasts as well,
such that the tercile categories are calculated correctly.
The probability for 'below', for example, is the fraction of ensemble members predicting below normal (for this coordinate).
tfc_from_efc(dt, by = setdiff(dimvars(dt), "member"), keep_cols = NULL, ...)
tfc_from_efc(dt, by = setdiff(dimvars(dt), "member"), keep_cols = NULL, ...)
dt |
The data table. |
by |
Names of columns to group by. |
keep_cols |
A vector of column names that you want to keep. Column names in by are kept automatically. |
... |
passed on to |
A new data table with tercile forecasts
test_dt = ecmwf_monthly[lat < 0 & month == 11] tfc = tfc_from_efc(test_dt)
test_dt = ecmwf_monthly[lat < 0 & month == 11] tfc = tfc_from_efc(test_dt)
This function wraps tfc_plot()
, but uses a different map for borders.
The map is part of the package and is the one currently used during GHACOFs at ICPAC.
tfc_gha_plot( ..., expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2), showplot = TRUE )
tfc_gha_plot( ..., expand.x = c(-0.5, 0.5), expand.y = c(-0.5, 2), showplot = TRUE )
... , expand.x , expand.y , showplot
|
passed to |
dt = tfc_from_efc(ecmwf_monthly[month == 11 & lat < 0]) pp = tfc_gha_plot(dt[year == 2018], expand.y = c(0.5,0.5)) if(interactive()) plot(pp)
dt = tfc_from_efc(ecmwf_monthly[month == 11 & lat < 0]) pp = tfc_gha_plot(dt[year == 2018], expand.y = c(0.5,0.5)) if(interactive()) plot(pp)
Plots spatial maps of tercile forecasts. Requires a data table with three columns "below"
, "normal"
, "above"
which sum to 1. For each gridpoint only the highest of the three values is plotted, so there are three colorscales.
tfc_plot( dt, discrete_cs = TRUE, rmax = NULL, below = "brown", normal = "gold", above = "forestgreen", na.value = "gray75", cs_names = c("below", "normal", "above"), oob = NULL, guide_barwidth = grid::unit(0.01, units = "npc"), guide_barheight = grid::unit(0.15, units = "npc"), legend_horizontal = FALSE, binwidth = "auto", add_map = TRUE, extent = NULL, expand.x = c(0, 0), expand.y = c(0, 0), showplot = TRUE, dimension_check = TRUE )
tfc_plot( dt, discrete_cs = TRUE, rmax = NULL, below = "brown", normal = "gold", above = "forestgreen", na.value = "gray75", cs_names = c("below", "normal", "above"), oob = NULL, guide_barwidth = grid::unit(0.01, units = "npc"), guide_barheight = grid::unit(0.15, units = "npc"), legend_horizontal = FALSE, binwidth = "auto", add_map = TRUE, extent = NULL, expand.x = c(0, 0), expand.y = c(0, 0), showplot = TRUE, dimension_check = TRUE )
dt |
Data table containing the data for plotting. |
discrete_cs |
Logical. Do you want to use discrete color scales (default) or not. |
rmax |
Optional value to fix the range of the colorscale (lower limit is always 0.33). |
below , normal , above
|
Colors to use for the three categories. Default is |
na.value |
Color to use for missing value. Default is |
cs_names |
Character vector of length three giving the legend titles for the below-, normal-, and above category. |
oob |
Behavior for data above |
guide_barwidth , guide_barheight
|
value to specify the width and height of the color guide. Are flipped if |
legend_horizontal |
Logical. Set to |
binwidth |
Width of the steps when a discrete colorscale is used. |
add_map |
logical, defaults to |
extent |
An optional four-element vector in the order xmin,xmax,ymin,ymax for specifying the spatial extent of the plot. Default is to fit the extent to the data. |
expand.x , expand.y
|
vectors with two entries to be added to xlims/ylims of the plot. E.g. expand.x = c(-0.5,0.5) expands the plot by half a longitude both on the right and left hand side. |
showplot |
Logical. Should the plot be displayed at the end? |
dimension_check |
Logical. By default the function checks that there are not multiple values per coordinate
(and subsets to the first level if there are several, e.g. to the first year and month (by appearance in |
a ggplot object.
Claudio Heinrich
#dt = tfc_from_efc(ecmwf_monthly[month == 11 & lat < 0]) #pp = tfc_plot(dt[year == 2018]) #if(interactive()) plot(pp)
#dt = tfc_from_efc(ecmwf_monthly[month == 11 & lat < 0]) #pp = tfc_plot(dt[year == 2018]) #if(interactive()) plot(pp)
returns all column names indicating a temporal coordinate.
time_dimvars(dt = NULL)
time_dimvars(dt = NULL)
dt |
optional. You can provide a data table, then the function returns the names of temporal coordinate columns in this data table. |
Character vector with column names.
time_dimvars()
time_dimvars()
this is mostly auxiliary and called from download_chirps_monthly. Uses the function upscale_regular_lon_lat, but derives the weights for upscaling only once for efficiency and avoids simultaneous loading of all CHIRPS data.
upscale_chirps( update = TRUE, years = NULL, months = NULL, upscale_grid = data.table(expand.grid(lon = seq(GHA_extent()[1], GHA_extent()[2], 0.5), lat = seq(GHA_extent()[3], GHA_extent()[4], 0.5))), root_dir = NULL, version = "UCSB", us_dir = file.path(root_dir, "upscaled") )
upscale_chirps( update = TRUE, years = NULL, months = NULL, upscale_grid = data.table(expand.grid(lon = seq(GHA_extent()[1], GHA_extent()[2], 0.5), lat = seq(GHA_extent()[3], GHA_extent()[4], 0.5))), root_dir = NULL, version = "UCSB", us_dir = file.path(root_dir, "upscaled") )
update |
Logical, if TRUE, files that have already been upscaled are skipped |
years , months
|
Which years and months do you want to upscale? NULL upscales everything there is (except if update is TRUE). |
upscale_grid |
A regular lon/lat grid for upscaling. Defaults to half degrees. |
root_dir |
directory where the high-resolution file is stored. |
version |
Version specifier, should be 'UCSB' or 'ICPAC'. The latter only works if you have access to CHIRPS blended. |
us_dir |
Directory where the low-resolution file will be stored. |
Nothing.
if(interactive()){ upscale_chirps() }
if(interactive()){ upscale_chirps() }
Upscales data from one regular lon-lat grid to another lon-lat grid that is coarser or of the same resolution. It uses conservative interpolation (rather than bilinear interpolation) which is the better choice for upscaling, see details below. If the fine grid and coarse grid are of the same resolution but shifted, results are (almost) identical to bilinear interpolation (almost because bilinear interpolation does not account for the fact that grid cells get smaller towards the pole, which this function does).
The function addresses the following major challenges:
The fine grid does not need to be nested in the coarse grid, creating different partial overlap scenarios. Therefore, the value of each fine grid cell may contribute to multiple (up to four) coarse grid cells.
Grid cell area varies with latitude, grid cells at the equator are much larger than at the poles. This affects the contribution of grid cells (grid cells closer to the pole contribute less to the coarse grid cell average).
Frequently, it is required to upscale repeated data between the same grids, for example when you want to upscale observations for many different years. In this case, the calculation of grid cell overlaps is only done once, and not repeated every time.
For coarse grid cells that are only partially covered, a minimal required fraction of coverage can be specified.
It is memory efficient: Naive merging of data tables or distance-based matching of grid cells is avoided, since it results in unnecessary large lookup tables that may not fit into memory when both your fine and your coarse grid are high-resolution.
upscale_regular_lon_lat( dt, coarse_grid, uscols, bycols = setdiff(dimvars(dt), c("lon", "lat")), save_weights = NULL, req_frac_of_coverage = 0 )
upscale_regular_lon_lat( dt, coarse_grid, uscols, bycols = setdiff(dimvars(dt), c("lon", "lat")), save_weights = NULL, req_frac_of_coverage = 0 )
dt |
data table containing the data you want to upscale. |
coarse_grid |
data table containing lons/lats of the grid you want to upscale to. |
uscols |
column name(s) of the data you want to upscale (can take multiple columns at once, but assumes that the different columns have missing values at the same position). |
bycols |
optional column names for grouping if you have repeated data on the same grid, e.g. use bycols = 'date' if your data table contains observations for many dates on the same grid (and the column specifying the date is in fact called 'date'). |
save_weights |
optional file name for saving the weights for upscaling. Used for the CHIRPS data. |
req_frac_of_coverage |
Numeric value between 0 and 1. All coarse grid cells with less coverage than this value get assigned a missing value. In particular, setting this to 0 (the default) means a value is assigned to each coarse grid cell that overlaps with at least one fine grid cell. Setting this to 1 means only coarse grid cells are kept for which we have full coverage. |
Bilinear interpolation is generally not appropriate for mapping data from finer to coarser grids. The reason is that in BI, the value of a coarse grid cell only depends on the four fine grid cells surrounding its center coordinate, even though many fine grid cells may overlap the coarse grid cell). Conservative interpolation calculates the coarse grid cell value by averaging all fine grid cells overlapping with it, weighted by the fraction of overlap. This is the appropriate way of upscaling when predictions and observations constitute grid point averages, which is usually the case (Göber et al. 2008).
The grids are assumed to be regular, but are not required to be complete (see set_spatial_grid
).
The function is faster when missing-data grid points are not contained in dt
(then fewer grid points need to be matched).
A data table with the upscaled values.
Göber, M., Ervin Z., and Richardson, D.S. (2008): "Could a perfect model ever satisfy a naïve forecaster? On grid box mean versus point verification." Meteorological Applications: A journal of forecasting, practical applications, training techniques and modelling 15, no. 3 (2008): 359-365.
For each location, the map shows whether the observed value was normal, below, or above. This makes it possible to visually compare to the usual tercile forecsst
ver_map( dt, o = obs_cols(dt), yy = dt[, max(year)], climatology_period = unique(dt[, year]), out_file = NULL )
ver_map( dt, o = obs_cols(dt), yy = dt[, max(year)], climatology_period = unique(dt[, year]), out_file = NULL )
dt |
input data table. This has to contain the observations for the year to plot, as well as for many other years (which are used to calculate the climatological reference).
The data table should have coumns named |
o |
name of the column containing the observation. |
yy |
The year for which to show the verification map. Defaults to the last year available in dt |
climatology_period |
which years should the climatology be calculated on? Defaults to all years (except |
out_file |
optional path and file name (including valid filetype, like .pdf or .png) for saving the file. If not provided, the function just shows the plot in the running R session. |
a gg object
# takes some time: pp = ver_map(chirps_monthly[month == 11],yy = 2018) if(interactive()) plot(pp)
# takes some time: pp = ver_map(chirps_monthly[month == 11],yy = 2018) if(interactive()) plot(pp)
The quantiles should be computed and saved by the function chirps_ver_map_quantiles
.
ver_map_chirps( mm = month(Sys.Date() - 60), yy = year(Sys.Date() - 60), version = "UCSB", resolution = "low", ... )
ver_map_chirps( mm = month(Sys.Date() - 60), yy = year(Sys.Date() - 60), version = "UCSB", resolution = "low", ... )
yy , mm
|
The year and month for which to show the verification map. Defaults to the month 60 days ago (in order to avoid using preliminary data). |
version |
which CHIRPS version to use. |
resolution |
Spatial resolution, 'high' or 'low' |
... |
passed on to ver_map. |
A gg object
# takes a while: if(interactive()) ver_map_chirps(mm = 12,yy = 2022)
# takes a while: if(interactive()) ver_map_chirps(mm = 12,yy = 2022)