Title: | Effects of External Conditions on Air Quality |
---|---|
Description: | Analyzes the impact of external conditions on air quality using counterfactual approaches, featuring methods for data preparation, modeling, and visualization. |
Authors: | Raphael Franke [aut], Imke Voss [aut, cre] |
Maintainer: | Imke Voss <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.0 |
Built: | 2025-02-27 07:21:05 UTC |
Source: | CRAN |
Model agnostic function to calculate a number of common performance
metrics on the reference time window.
Uses the true data value
and the predictions prediction
for this calculation.
The coverage is calculated from the columns value
, prediction_lower
and
prediction_upper
.
Removes dates in the effect and buffer range as the model is not expected to
be performing correctly for these times. The incorrectness is precisely
what we are using for estimating the effect.
calc_performance_metrics(predictions, date_effect_start = NULL, buffer = 0)
calc_performance_metrics(predictions, date_effect_start = NULL, buffer = 0)
predictions |
data.table or data.frame with the following columns
|
date_effect_start |
A date. Start date of the effect that is to be evaluated. The data from this point onwards is disregarded for calculating model performance |
buffer |
Integer. An additional buffer window before date_effect_start to account for uncertainty in the effect start point. Disregards additional buffer data points for model evaluation |
Named vector with performance metrics of the model
Helps with analyzing predictions by comparing them with the true values on a number of relevant summary statistics.
calc_summary_statistics(predictions, date_effect_start = NULL, buffer = 0)
calc_summary_statistics(predictions, date_effect_start = NULL, buffer = 0)
predictions |
Data.table or data.frame with the following columns
|
date_effect_start |
A date. Start date of the effect that is to be evaluated. The data from this point onwards is disregarded for calculating model performance |
buffer |
Integer. An additional buffer window before date_effect_start to account for uncertainty in the effect start point. Disregards additional buffer data points for model evaluation |
data.frame of summary statistics with columns true and prediction
Cleans a data table of environmental measurements by filtering for a specific station, removing duplicates, and optionally aggregating the data on a daily basis using the mean.
clean_data(env_data, station, aggregate_daily = FALSE)
clean_data(env_data, station, aggregate_daily = FALSE)
env_data |
A data table in long format. Must include columns:
|
station |
Character. Name of the station to filter by. |
aggregate_daily |
Logical. If |
Duplicate rows (by date
, Komponente
, and Station
) are removed. A warning is issued
if duplicates are found.
A data.table
:
If aggregate_daily = TRUE
: Contains columns for station, component, day, year,
and the daily mean value of the measurements.
If aggregate_daily = FALSE
: Contains cleaned data with duplicates removed.
# Example data env_data <- data.table::data.table( Station = c("DENW094", "DENW094", "DENW006", "DENW094"), Komponente = c("NO2", "O3", "NO2", "NO2"), Wert = c(45, 30, 50, 40), date = as.POSIXct(c( "2023-01-01 08:00:00", "2023-01-01 09:00:00", "2023-01-01 08:00:00", "2023-01-02 08:00:00" )), Komponente_txt = c( "Nitrogen Dioxide", "Ozone", "Nitrogen Dioxide", "Nitrogen Dioxide" ) ) # Clean data for StationA without aggregation cleaned_data <- clean_data(env_data, station = "DENW094", aggregate_daily = FALSE) print(cleaned_data)
# Example data env_data <- data.table::data.table( Station = c("DENW094", "DENW094", "DENW006", "DENW094"), Komponente = c("NO2", "O3", "NO2", "NO2"), Wert = c(45, 30, 50, 40), date = as.POSIXct(c( "2023-01-01 08:00:00", "2023-01-01 09:00:00", "2023-01-01 08:00:00", "2023-01-02 08:00:00" )), Komponente_txt = c( "Nitrogen Dioxide", "Ozone", "Nitrogen Dioxide", "Nitrogen Dioxide" ) ) # Clean data for StationA without aggregation cleaned_data <- clean_data(env_data, station = "DENW094", aggregate_daily = FALSE) print(cleaned_data)
Copies the default params.yaml
file, included with the package, to a
specified destination directory. This is useful for initializing parameter
files for custom edits.
copy_default_params(dest_dir)
copy_default_params(dest_dir)
dest_dir |
Character. The path to the directory where the |
The params.yaml
file contains default model parameters for various
configurations such as LightGBM, dynamic regression, and others. See the
load_params()
‘ documentation for an example of the file’s structure.
Nothing is returned. A message is displayed upon successful copying.
copy_default_params(tempdir())
copy_default_params(tempdir())
Takes a list of train and application data as prepared by
split_data_counterfactual()
and removes a polynomial, exponential or cubic spline spline trend function.
Trend is obtained only from train data. Use as part of preprocessing before
training a model based on decision trees, i.e. random forest and lightgbm.
For the other methods it may be helpful but they are generally able to
deal with trends themselves. Therefore we recommend to try out different
versions and guide decisisions using the model evaluation metrics from
calc_performance_metrics()
.
detrend(split_data, mode = "linear", num_splines = 5, log_transform = FALSE)
detrend(split_data, mode = "linear", num_splines = 5, log_transform = FALSE)
split_data |
List of two named dataframes called train and apply |
mode |
String which defines type of trend is present. Options are "linear", "quadratic", "exponential", "spline", "none". "none" returns original data |
num_splines |
Defines the number of cubic splines if |
log_transform |
If |
Apply retrend_predictions()
to predictions to return to the
original data units.
List of 3 elements. 2 dataframes: detrended train, apply and the trend function
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) detrended_list <- detrend(split_data, mode = "linear") detrended_train <- detrended_list$train detrended_apply <- detrended_list$apply trend <- detrended_list$model
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) detrended_list <- detrend(split_data, mode = "linear") detrended_train <- detrended_list$train detrended_apply <- detrended_list$apply trend <- detrended_list$model
Calculates an estimate for the absolute and relative effect size of the external effect. The absolute effect is the difference between the model bias in the reference time and the effect time windows. The relative effect is the absolute effect divided by the mean true value in the reference window.
estimate_effect_size(df, date_effect_start, buffer = 0, verbose = FALSE)
estimate_effect_size(df, date_effect_start, buffer = 0, verbose = FALSE)
df |
Data.table or data.frame with the following columns
|
date_effect_start |
A date. Start date of the effect that is to be evaluated. The data from this point onward is disregarded for calculating model performance. |
buffer |
Integer. An additional buffer window before date_effect_start to account for uncertainty in the effect start point. Disregards additional buffer data points for model evaluation |
verbose |
Prints an explanation of the results if TRUE |
Note: Since the bias is of the model is an average over predictions and true values, it is important, that the effect window is specified correctly. Imagine a scenario like a fire which strongly affects the outcome for one hour and is gone the next hour. If we use a two week effect window, the estimated effect will be 14*24=336 times smaller compared to using a 1-hour effect window. Generally, we advise against studying very short effects (single hour or single day). The variability of results will be too large to learn anything meaningful.
A list with two numbers: Absolute and relative estimated effect size.
Identifies unique meteorological components from the provided environmental data, filtering only those that match the predefined UBA naming conventions. These components include "GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", and "WIND_V".
get_meteo_available(env_data)
get_meteo_available(env_data)
env_data |
Data table containing environmental data. Must contain column "Komponente" |
A vector of available meteorological components.
# Example environmental data env_data <- data.table::data.table( Komponente = c("TMP", "NO2", "GLO", "WIR"), Wert = c(25, 40, 300, 50), date = as.POSIXct(c( "2023-01-01 08:00:00", "2023-01-01 09:00:00", "2023-01-01 10:00:00", "2023-01-01 11:00:00" )) ) # Get available meteorological components meteo_components <- get_meteo_available(env_data) print(meteo_components)
# Example environmental data env_data <- data.table::data.table( Komponente = c("TMP", "NO2", "GLO", "WIR"), Wert = c(25, 40, 300, 50), date = as.POSIXct(c( "2023-01-01 08:00:00", "2023-01-01 09:00:00", "2023-01-01 10:00:00", "2023-01-01 11:00:00" )) ) # Get available meteorological components meteo_components <- get_meteo_available(env_data) print(meteo_components)
Reads a YAML file containing model parameters, including station settings,
variables, and configurations for various models. If no file path is
provided, the function defaults to loading params.yaml
from the package's
extdata
directory.
load_params(filepath = NULL)
load_params(filepath = NULL)
filepath |
Character. Path to the YAML file. If |
The YAML file should define parameters in a structured format, such as:
target: 'NO2' lightgbm: nrounds: 200 eta: 0.03 num_leaves: 32 dynamic_regression: ntrain: 8760 random_forest: num.trees: 300 max.depth: 10 meteo_variables: - GLO - TMP
A list containing the parameters loaded from the YAML file.
params <- load_params()
params <- load_params()
This function loads data from CSV files in the specified directory. It supports two formats:
load_uba_data_from_dir(data_dir)
load_uba_data_from_dir(data_dir)
data_dir |
Character. Path to the directory containing |
"inv": Files must contain the following columns:
Station
, Komponente
, Datum
, Uhrzeit
, Wert
.
"24Spalten": Files must contain:
Station
, Komponente
, Datum
, and columns Wert01
, ..., Wert24
.
File names should include "inv" or "24Spalten" to indicate their format. The function scans
recursively for .csv
files in subdirectories and combines the data into a single data.table
in long format.
Files that are not in the exected format will be ignored.
A data.table
containing the loaded data in long format. Returns an error if no valid
files are found or the resulting dataset is empty.
A small dataset of environmental variables created for testing and examples. This dataset includes hourly observations with random values for meteorological and temporal variables.
mock_env_data
mock_env_data
A data frame with 100 rows and 12 variables:
POSIXct. Date and time of the observation (hourly increments).
Numeric. Randomly generated target variable.
Numeric. Global radiation in W/m² (random values between 0 and 1000).
Numeric. Temperature in °C (random values between -10 and 35).
Numeric. Rainfall in mm (random values between 0 and 50).
Numeric. Wind speed in m/s (random values between 0 and 20).
Numeric. Wind direction in degrees (random values between 0 and 360).
Numeric. Longwave downward radiation in W/m² (random values between 0 and 500).
Integer. Julian day of the year, ranging from 1 to 10.
Integer. Day of the week, ranging from 1 (Monday) to 7 (Sunday).
Integer. Hour of the day, ranging from 0 to 23.
Numeric. UNIX timestamp (seconds since 1970-01-01 00:00:00 UTC).
Generated within the package for example purposes.
data(mock_env_data) head(mock_env_data)
data(mock_env_data) head(mock_env_data)
Smooths the predictions using a rolling mean, prepares the data for plotting, and generates the counterfactual plot for the application window. Data before the red box are reference window, red box is buffer and values after black, dotted line are effect window.
plot_counterfactual( predictions, params, window_size = 14, date_effect_start = NULL, buffer = 0, plot_pred_interval = TRUE )
plot_counterfactual( predictions, params, window_size = 14, date_effect_start = NULL, buffer = 0, plot_pred_interval = TRUE )
predictions |
The data.table containing the predictions (hourly) |
params |
Parameters for plotting, including the target variable. |
window_size |
The window size for the rolling mean (default is 14 days). |
date_effect_start |
A date. Start date of the effect that is to be evaluated. The data from this point onwards is disregarded for calculating model performance |
buffer |
Integer. An additional, optional buffer window before
|
plot_pred_interval |
Boolean. If |
The optional grey ribbon is a prediction interval for the hourly values. The
interpretation for a 90% prediction interval (to be defined in alpha
parameter
of run_counterfactual()
) is that 90% of the true hourly values
(not the rolled means) lie within the grey band. This might be helpful for
getting an idea of the variance of the data and predictions.
A ggplot object with the counterfactual plot. Can be adjusted further, e.g. set limits for the y-axis for better visualisation.
This function produces descriptive time-series plots with smoothing for the meteorological and potential target variables that were measured at a station.
plot_station_measurements( env_data, variables, years = NULL, smoothing_factor = 1 )
plot_station_measurements( env_data, variables, years = NULL, smoothing_factor = 1 )
env_data |
A data table of measurements of one air quality measurement station. The data should contain the following columns:
|
variables |
list of variables to plot. Must be in |
years |
Optional. A numeric vector, list, or a range specifying the years to restrict the plotted data. You can provide:
|
smoothing_factor |
A number that defines the magnitude of smoothing. Default is 1. Smaller numbers correspond to less smoothing, larger numbers to more. |
A ggplot
object. This object contains:
A time-series line plot for each variable in variables
.
Smoothed lines, with smoothing defined by smoothing_factor
.
library(data.table) env_data <- data.table( Station = "Station_1", Komponente = rep(c("TMP", "NO2"), length.out = 100), Wert = rnorm(100, mean = 20, sd = 5), date = rep(seq.POSIXt(as.POSIXct("2022-01-01"), , "hour", 50), each = 2), year = 2022, Komponente_txt = rep(c("Temperature", "NO2"), length.out = 100) ) plot <- plot_station_measurements(env_data, variables = c("TMP", "NO2"))
library(data.table) env_data <- data.table( Station = "Station_1", Komponente = rep(c("TMP", "NO2"), length.out = 100), Wert = rnorm(100, mean = 20, sd = 5), date = rep(seq.POSIXt(as.POSIXct("2022-01-01"), , "hour", 50), each = 2), year = 2022, Komponente_txt = rep(c("Temperature", "NO2"), length.out = 100) ) plot <- plot_station_measurements(env_data, variables = c("TMP", "NO2"))
Prepares environmental data by filtering for relevant components,
converting the data to a wide format, and adding temporal features. Should be
called before
split_data_counterfactual()
prepare_data_for_modelling(env_data, params)
prepare_data_for_modelling(env_data, params)
env_data |
A data table in long format. Must include the following columns:
|
params |
A list of modelling parameters loaded from
|
A data.table
in wide format, with columns:
date
, one column per component, and temporal features
like date_unix
, day_julian
, weekday
, and hour
.
env_data <- data.table::data.table( Station = c("StationA", "StationA", "StationA"), Komponente = c("NO2", "TMP", "NO2"), Wert = c(50, 20, 40), date = as.POSIXct(c("2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 12:00:00")) ) params <- list(meteo_variables = c("TMP"), target = "NO2") prepared_data <- prepare_data_for_modelling(env_data, params) print(prepared_data)
env_data <- data.table::data.table( Station = c("StationA", "StationA", "StationA"), Komponente = c("NO2", "TMP", "NO2"), Wert = c(50, 20, 40), date = as.POSIXct(c("2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 12:00:00")) ) params <- list(meteo_variables = c("TMP"), target = "NO2") prepared_data <- prepare_data_for_modelling(env_data, params) print(prepared_data)
This function rescales the predicted values (prediction
, prediction_lower
,
prediction_upper
). The scaling is reversed using the means and
standard deviations that were saved from the training data. It is the inverse
function to scale_data()
and should be used only in combination.
rescale_predictions(scale_result, dt_predictions)
rescale_predictions(scale_result, dt_predictions)
scale_result |
A list object returned by |
dt_predictions |
A data frame containing the predictions,
including columns |
A data frame with the predictions and numeric columns rescaled back to their original scale.
data(mock_env_data) scale_res <- scale_data( train_data = mock_env_data[1:80, ], apply_data = mock_env_data[81:100, ] ) params <- load_params() res <- run_lightgbm( train = scale_res$train, test = scale_res$apply, params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) dt_predictions <- res$dt_predictions rescaled_predictions <- rescale_predictions(scale_res, dt_predictions)
data(mock_env_data) scale_res <- scale_data( train_data = mock_env_data[1:80, ], apply_data = mock_env_data[81:100, ] ) params <- load_params() res <- run_lightgbm( train = scale_res$train, test = scale_res$apply, params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) dt_predictions <- res$dt_predictions rescaled_predictions <- rescale_predictions(scale_res, dt_predictions)
Takes a dataframe of predictions as returned by any of
the 'run_model' functions and restores a trend which was previously
removed via detrend()
. This is necessary for the predictions
and the true values to have the same units. The function is basically
the inverse function to detrend()
and should only be used in
combination with it.
retrend_predictions(dt_predictions, trend, log_transform = FALSE)
retrend_predictions(dt_predictions, trend, log_transform = FALSE)
dt_predictions |
Dataframe of predictions with columns |
trend |
lm object generated by |
log_transform |
Returns values to solution space, if they have been
log transformed during detrending. Use only in combination with |
Retrended dataframe with same structure as dt_predictions
which is returned by any of the run_model() functions.
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() detrended_list <- detrend(split_data, mode = "linear" ) trend <- detrended_list$model detrended_train <- detrended_list$train detrended_apply <- detrended_list$apply result <- run_lightgbm( train = detrended_train, test = detrended_apply, model_params = params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) retrended_predictions <- retrend_predictions(result$dt_predictions, trend)
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() detrended_list <- detrend(split_data, mode = "linear" ) trend <- detrended_list$model detrended_train <- detrended_list$train detrended_apply <- detrended_list$apply result <- run_lightgbm( train = detrended_train, test = detrended_apply, model_params = params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) retrended_predictions <- retrend_predictions(result$dt_predictions, trend)
Chains detrending, training of a selected model, prediction and retrending together for ease of use. See documentation of individual functions for details.
run_counterfactual( split_data, params, detrending_function = "none", model_type = "rf", alpha = 0.9, log_transform = FALSE, calc_shaps = FALSE )
run_counterfactual( split_data, params, detrending_function = "none", model_type = "rf", alpha = 0.9, log_transform = FALSE, calc_shaps = FALSE )
split_data |
List of two named dataframes called train and apply |
params |
A list of parameters that define the following:
|
detrending_function |
String which defines type of trend to remove.
Options are "linear","quadratic", "exponential", "spline", "none". See |
model_type |
String to decide which model to use. Current options random forest "rf", gradient boosted decision trees "lightgbm", "dynamic_regression" and feedforward neural network "fnn" |
alpha |
Confidence level of the prediction interval between 0 and 1. |
log_transform |
If TRUE, uses log transformation during detrending and
retrending. For details see |
calc_shaps |
Boolean value. If TRUE, calculate SHAP values for the
method used and format them so they can be visualised with |
Data frame of predictions, model and importance
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() res <- run_counterfactual(split_data, params, detrending_function = "linear") prediction <- res$retrended_predictions random_forest_model <- res$model
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() res <- run_counterfactual(split_data, params, detrending_function = "linear") prediction <- res$retrended_predictions random_forest_model <- res$model
This function trains a dynamic regression model with fourier transformed temporal features and meteorological variables as external regressors on the specified training dataset and makes predictions on the test dataset in a counterfactual scenario. This is referred to as a dynamic regression model in Forecasting: Principles and Practise, Chapter 10 - Dynamic regression models
run_dynamic_regression(train, test, params, alpha, calc_shaps)
run_dynamic_regression(train, test, params, alpha, calc_shaps)
train |
Dataframe of train data as returned by the |
test |
Dataframe of test data as returned by the |
params |
list of hyperparameters to use in dynamic_regression call. Only uses ntrain to specify the number of data points to use for training. Default is 8760 which results in 1 year of hourly data |
alpha |
Confidence level of the prediction interval between 0 and 1. |
calc_shaps |
Boolean value. If TRUE, calculate SHAP values for the
method used and format them so they can be visualised with |
Note: Runs the dynamic regression model for individualised use with own data pipeline.
Otherwise use run_counterfactual()
to call this function.
Data frame of predictions and model
Trains a feedforward neural network (FNN) model on the specified training dataset and makes predictions on the test dataset in a counterfactual scenario. The model uses meteorological variables and sin/cosine-transformed features. Scales the data before training and rescales predictions, as the model does not converge with unscaled data.
run_fnn(train, test, params, calc_shaps)
run_fnn(train, test, params, calc_shaps)
train |
A data frame or tibble containing the training dataset,
including the target variable ( |
test |
A data frame or tibble containing the test dataset on which predictions will be made, using the same meteorological variables as in the training dataset. |
params |
A list of parameters that define the following:
|
calc_shaps |
Boolean value. If TRUE, calculate SHAP values for the
method used and format them so they can be visualised with
|
This function provides flexibility for users with their own data pipelines
or workflows. For a simplified pipeline, consider using
run_counterfactual()
.
Experiment with hyperparameters such as learning_rate
,
batchsize
, hidden_layers
, and num_epochs
to improve
performance.
Warning: Using many or large hidden layers in combination with a high number of epochs can lead to long training times.
A list with three elements:
dt_predictions
A data frame containing the test data along with the predicted values:
prediction
The predicted values from the FNN model.
prediction_lower
The same predicted values, as no quantile model is available yet for FNN.
prediction_upper
The same predicted values, as no quantile model is available yet for FNN.
model
The trained FNN model object from the
deepnet::nn.train()
function.
importance
SHAP importance values (if
calc_shaps = TRUE
). Otherwise, NULL
.
data(mock_env_data) params <- load_params() res <- run_fnn( train = mock_env_data[1:80, ], test = mock_env_data[81:100, ], params, calc_shaps = FALSE )
data(mock_env_data) params <- load_params() res <- run_fnn( train = mock_env_data[1:80, ], test = mock_env_data[81:100, ], params, calc_shaps = FALSE )
This function trains a gradient boosting model (lightgbm) on the specified training dataset and makes predictions on the test dataset in a counterfactual scenario. The model uses meteorological variables and temporal features.
run_lightgbm(train, test, model_params, alpha, calc_shaps)
run_lightgbm(train, test, model_params, alpha, calc_shaps)
train |
Dataframe of train data as returned by the |
test |
Dataframe of test data as returned by the |
model_params |
list of hyperparameters to use in lgb.train call.
See |
alpha |
Confidence level of the prediction interval between 0 and 1. |
calc_shaps |
Boolean value. If TRUE, calculate SHAP values for the
method used and format them so they can be visualised with |
Note: Runs the gradient boosting model for individualised use with own data pipeline.
Otherwise use run_counterfactual()
to call this function.
List with data frame of predictions and model
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() variables <- c("day_julian", "weekday", "hour", params$meteo_variables) res <- run_lightgbm( train = mock_env_data[1:80, ], test = mock_env_data[81:100, ], params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) prediction <- res$dt_predictions model <- res$model
data(mock_env_data) split_data <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) params <- load_params() variables <- c("day_julian", "weekday", "hour", params$meteo_variables) res <- run_lightgbm( train = mock_env_data[1:80, ], test = mock_env_data[81:100, ], params$lightgbm, alpha = 0.9, calc_shaps = FALSE ) prediction <- res$dt_predictions model <- res$model
This function trains a random forest model (ranger) on the specified training dataset and makes predictions on the test dataset in a counterfactual scenario. The model uses meteorological variables and temporal features.
run_rf(train, test, model_params, alpha, calc_shaps)
run_rf(train, test, model_params, alpha, calc_shaps)
train |
Dataframe of train data as returned by the |
test |
Dataframe of test data as returned by the |
model_params |
list of hyperparameters to use in ranger call. See |
alpha |
Confidence level of the prediction interval between 0 and 1. |
calc_shaps |
Boolean value. If TRUE, calculate SHAP values for the
method used and format them so they can be visualised with |
Note: Runs the random forest model for individualised use with own data pipeline.
Otherwise use run_counterfactual()
to call this function.
List with data frame of predictions and model
This dataset contains environmental measurements from the Leipzig Mitte station provided by the Sächsisches Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG). Alterations in the data: Codes for incorrect values have been removed.
sample_data_DESN025
sample_data_DESN025
A data table with the following columns:
Station identifier where the data was collected.
The environmental component being measured (e.g., temperature, NO2).
The measured value of the component.
The timestamp for the observation, formatted as a Date-Time
object in the format
"YYYY-MM-DD HH:MM:SS"
(e.g., "2010-01-01 07:00:00").
A textual description or label for the component.
The dataset is structured in a long format and is prepared for further transformation into a wide format for modelling.
The dataset is licensed under the "Data Licence Germany – attribution – version 2.0 (DL-DE-BY-2.0)". (1) Any use will be permitted provided it fulfils the requirements of this "Data licence Germany – attribution – Version 2.0".
The data and meta-data provided may, for commercial and non-commercial use, in particular
be copied, printed, presented, altered, processed and transmitted to third parties;
be merged with own data and with the data of others and be combined to form new and independent datasets;
be integrated in internal and external business processes, products and applications in public and non-public electronic networks.
(2) The user must ensure that the source note contains the following information:
the name of the provider,
the annotation "Data licence Germany – attribution – Version 2.0" or "dl-de/by-2-0" referring to the licence text available at www.govdata.de/dl-de/by-2-0, and
a reference to the dataset (URI).
This applies only if the entity keeping the data provides the pieces of information 1-3 for the source note.
(3) Changes, editing, new designs or other amendments must be marked as such in the source note.
For more information on the license, visit https://www.govdata.de/dl-de/by-2-0.
Sächsisches Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG).
data(sample_data_DESN025) params <- load_params() dt_prepared <- prepare_data_for_modelling(sample_data_DESN025, params)
data(sample_data_DESN025) params <- load_params() dt_prepared <- prepare_data_for_modelling(sample_data_DESN025, params)
This function standardizes numeric columns of the train_data
and applies
the same scaling (mean and standard deviation) to the corresponding columns
in apply_data
. It returns the standardized data along with the scaling
parameters (means and standard deviations). This is particularly important
for neural network approaches as they tend to be numerically unstable and
deteriorate otherwise.
scale_data(train_data, apply_data)
scale_data(train_data, apply_data)
train_data |
A data frame containing the training dataset to be standardized. It must contain numeric columns. |
apply_data |
A data frame containing the dataset to which the scaling
from |
A list containing the following elements:
train |
The standardized training data. |
apply |
The |
means |
The means of the numeric columns in |
sds |
The standard deviations of the numeric columns in |
data(mock_env_data) detrended_list <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) scale_result <- scale_data( train_data = detrended_list$train, apply_data = detrended_list$apply ) scaled_train <- scale_result$train scaled_apply <- scale_result$apply
data(mock_env_data) detrended_list <- list( train = mock_env_data[1:80, ], apply = mock_env_data[81:100, ] ) scale_result <- scale_data( train_data = detrended_list$train, apply_data = detrended_list$apply ) scaled_train <- scale_result$train scaled_apply <- scale_result$apply
Splits prepared data into training and application datasets based on
specified date ranges for a business-as-usual scenario. Data before
application_start
and after application_end
is used as training data,
while data within the date range is used for application.
split_data_counterfactual(dt_prepared, application_start, application_end)
split_data_counterfactual(dt_prepared, application_start, application_end)
dt_prepared |
The prepared data table. |
application_start |
The start date(date object) for the application period of the business-as-usual simulation. This coincides with the start of the reference window. Can be created by e.g. lubridate::ymd("20191201") |
application_end |
The end date(date object) for the application period of the business-as-usual simulation. This coincides with the end of the effect window. Can be created by e.g. lubridate::ymd("20191201") |
A list with two elements:
Data outside the application period.
Data within the application period.
dt_prepared <- data.table::data.table( date = as.Date(c("2023-01-01", "2023-01-05", "2023-01-10")), value = c(50, 60, 70) ) result <- split_data_counterfactual( dt_prepared, application_start = as.Date("2023-01-03"), application_end = as.Date("2023-01-08") ) print(result$train) print(result$apply)
dt_prepared <- data.table::data.table( date = as.Date(c("2023-01-01", "2023-01-05", "2023-01-10")), value = c(50, 60, 70) ) result <- split_data_counterfactual( dt_prepared, application_start = as.Date("2023-01-03"), application_end = as.Date("2023-01-08") ) print(result$train) print(result$apply)