--- author: "Nuria Perez" date: "`r Sys.Date()`" revisor: "Eva Rifà" revision date: "October 2023" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Data Storage and Retrieval} %\usepackage[utf8]{inputenc} --- Data Storage and Retrieval ----------------------------------------- CSTools is aim for post-processing seasonal climate forecast with state-of-the-art methods. However, some doubts and issues may arise the first time using the package: do I need an specific R version? how much RAM memory I need? where I can find the datasets? Should I format the datasets? etc. Therefore, some recommendations and key points to take into account are gathered here in order to facilitate the use of CSTools. ### 1. System requirements The first question may come to a new user is the requirements of my computer to run CSTools. Here, the list of most frequent needs: - netcdf library version 4.1 or later - cdo (I am currently using 1.6.3) - R 3.4.2 or later On the other hand, the computational power of a computer could be a limitation, but it will depends on the size of the data that the users need for their analysis. For instance, they can estimate the memory they will require by multiplying the following values: - Area of the study region (km^2) - Area of the desired grid cell (km) (or square of the grid cell size) - Number of models + 1 observational dataset - Forecast time length (days or months) - Temporal resolution (1 for daily, 4 for 6 hourly or 24 for daily data) - Hindcast length (years) - Number of start date (or season) - Number of members - Extra factor for functions computation(*) For example, if they want to use the hindcast of 3 different seasonal simulations with 9 members, in daily resolution, for performing a regional study let's say in a region of 40000 km2 with a resolution of 5 km: > 200km x 200km / (5km * 5km) * (3 + 1) models * 214 days * 30 hindcast years * 9 members x 2 start dates x 8 bytes ~ 6 GB (*)Furthermore, some of the functions need to duplicated or triplicate (even more) the inputs for performing their analysis. Therefore, between 12 and 18 GB of RAM memory would be necessary, in this example. ### 2. Overview of CSTools structure All CSTools functions have been developed following the same guidelines. The main point, interesting for the users, is that that one function is built on several nested levels, and it is possible to distinguish at least three levels: - `CST_FunctionName()` this function works on s2dv_cube objects which is exposed to the users. - `FunctionName()`this function works on N-dimensional arrays with named dimensions and it is exposed to the users. - lower level functions such as `.functionname()` which works in the minimum required elements and it is not exposed to the user. A reasonable important doubt that a new user may have at this point is: what 's2dv_cube' object is? 's2dv_cube' is a class of an object storing the data and metadata in several elements: + $data element is an N-dimensional array with named dimensions containing the data (e.g.: temperature values), + $dims vector with the dimensions of $data + $coords is a named list with elements of the coordinates vectors corresponding to the dimensions of the $data, + $attrs is a named list with elements corresponding to attributes of the object. It has the following elements: + $Variable is a list with the variable name in element $varName and with the metadata of all the variables in the $metadata element, + $Dates is an array of dates of the $data element, + other elements for extra metadata information It is possible to visualize an example of the structure of 's2dv_cube' object by opening an R session and running: ``` library(CSTools) class(lonlat_temp_st$exp) # check the class of the object lonlat_temp$exp names(lonlat_temp_st$exp) # shows the names of the elements in the object lonlat_temp$exp str(lonlat_temp_st$exp) # shows the full structure of the object lonlat_temp$exp ``` ### 3. Data storage recommendations CSTools main objective is to share state-of-the-arts post-processing methods with the scientific community. However, in order to facilitate its use, CSTools package includes a function, `CST_Load`, to read the files and have the data available in 's2dv_cube' format in the R session memory to conduct the analysis. Some benefits of using this function are: - CST_Load can read multiple experimental or observational datasets at once, - CST_Load can regrid all datasets to a common grid, - CST_Load reformat observational datasets in the same structure than experiments (i.e. matching start dates and forecast lead time between experiments and observations) or keep observations as usual time series (i.e. continous temporal dimension), - CST_Load can subset a region from global files, - CST_Load can read multiple members in monthly, daily or other resolutions, - CST_Load can perform spatial averages over a defined region or return the lat-lon grid and - CST_Load can read from files using multiple parallel processes among other possibilites. CSTools also has the function `CST_Start` from [startR](https://CRAN.R-project.org/package=startR) that is more flexible than `CST_Load`. We recommend to use `CST_Start` since it's more efficient and flexible. If you plan to use `CST_Load` or `CST_Start`, we have developed guidelines to download and formatting the data. See [CDS_Seasonal_Downloader](https://earth.bsc.es/gitlab/es/cds-seasonal-downloader). There are alternatives to these functions, for instance, the user can: 1) use another tool to read the data from files (e.g.: ncdf4, easyNDCF, startR packages) and then convert it to the class 's2dv_cube' with `s2dv.cube()` function or 2) If they keep facing problems to convert the data to that class, they can just skip it and work with the functions without the prefix 'CST_'. In this case, they will be able to work with the basic class 'array'. Independently of the tool used to read the data from your local storage to your R session, this step can be automatized by given a common structure and format to all datasets in your local storate. Here, there is the list of minimum requirements that CST_Save follows to be able to store an experiment that could be later loaded with CST_Load: - this function creates one NetCDF file per start date with the name of the variable and the start date: `$VARNAME$_$YEAR$$MONTH$.nc` - each file has dimensions: lon, lat, ensemble and time. ### 4. CST_Load example ``` library(CSTools) library(zeallot) path <- "/esarchive/exp/meteofrance/system6c3s/$STORE_FREQ$_mean/$VAR_NAME$_f6h/$VAR_NAME$_$START_DATE$.nc" ini <- 1993 fin <- 2012 month <- '05' start <- as.Date(paste(ini, month, "01", sep = ""), "%Y%m%d") end <- as.Date(paste(fin, month, "01", sep = ""), "%Y%m%d") dateseq <- format(seq(start, end, by = "year"), "%Y%m%d") c(exp, obs) %<-% CST_Load(var = 'sfcWind', exp = list(list(name = 'meteofrance/system6c3s', path = path)), obs = 'erainterim', sdates = dateseq, leadtimemin = 2, leadtimemax = 4, lonmin = -19, lonmax = 60.5, latmin = 0, latmax = 79.5, storefreq = "daily", sampleperiod = 1, nmember = 9, output = "lonlat", method = "bilinear", grid = "r360x180") ``` ### 5. CST_Start example ```r path_exp <- paste0('/esarchive/exp/meteofrance/system6c3s/monthly_mean/', '$var$_f6h/$var$_$sdate$.nc') sdates <- sapply(1993:2012, function(x) paste0(x, '0501')) lonmax <- 60.5 lonmin <- -19 latmax <- 79.5 latmin <- 0 exp <- CST_Start(dataset = path_exp, var = 'sfcWind', ensemble = startR::indices(1:9), sdate = sdates, time = startR::indices(1:3), latitude = startR::values(list(latmin, latmax)), latitude_reorder = startR::Sort(decreasing = TRUE), longitude = startR::values(list(lonmin, lonmax)), longitude_reorder = startR::CircularSort(0, 360), synonims = list(longitude = c('lon', 'longitude'), latitude = c('lat', 'latitude')), return_vars = list(latitude = NULL, longitude = NULL, time = 'sdate'), retrieve = TRUE) path_obs <- paste0('/esarchive/recon/ecmwf/erainterim/daily_mean/', '$var$_f6h/$var$_$sdate$.nc') dates <- as.POSIXct(sdates, format = '%Y%m%d', 'UTC') obs <- CST_Start(dataset = path_obs, var = 'sfcWind', sdate = unique(format(dates, '%Y%m')), time = startR::indices(2:4), latitude = startR::values(list(latmin, latmax)), latitude_reorder = startR::Sort(decreasing = TRUE), longitude = startR::values(list(lonmin, lonmax)), longitude_reorder = startR::CircularSort(0, 360), synonims = list(longitude = c('lon', 'longitude'), latitude = c('lat', 'latitude')), transform = startR::CDORemapper, transform_extra_cells = 2, transform_params = list(grid = 'r360x181', method = 'conservative'), transform_vars = c('latitude', 'longitude'), return_vars = list(longitude = NULL, latitude = NULL, time = 'sdate'), retrieve = TRUE) ``` Extra lines to see the size of the objects and visualize the data: ``` library(pryr) object_size(exp) # 27.7 MB object_size(obs) # 3.09 MB library(s2dv) PlotEquiMap(exp$data[1,1,1,1,1,,], lon = exp$coords$longitude, lat= exp$coords$latitude, filled.continents = FALSE, fileout = "Meteofrance_r360x180.png") ``` ![Meteofrance](../vignettes/Figures/Meteofrance_r360x180.png) ### Managing big datasets and memory issues Depending on the user needs, limitations can be found when trying to process big datasets. This may depend on the number of ensembles, the resolution and region that the user wants to process. CSTools has been developed for compatibility of startR package which covers these aims: - retrieving data from NetCDF files to RAM memory in a flexible way, - divide automatically datasets in pieces to perform an analysis avoiding memory issues and - run the workflow in your local machine or submitting to an HPC cluster. This is especially useful when a user doesn’t have access to a HPC and must work with small RAM memory size: ![](./Figures/CSTvsNonCST.png) There is a [video tutorial](https://earth.bsc.es/wiki/lib/exe/fetch.php?media=tools:startr_tutorial_2020.mp4) about the startR package and the tutorial material. The functions in CSTools (with or without CST_ prefix) include a parameter called ‘ncores’ that allows to automatically parallelize the code in multiple cores when the parameter is set greater than one.