--- title: "Regression case: Assessing model agreement in wheat grain nitrogen content prediction" author: "Leo Bastos & Adrian Correndo" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Regression case} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## 1. Introduction
The *`metrica`* package was developed to visualize and compute the level of agreement between observed ground-truth values and model-derived (e.g., mechanistic or empirical) predictions. This package is intended to fit into the following workflow: 1. a data set containing the observed values is used to train a model 2. the trained model is used to generate predictions 3. a data frame containing at least the **observed** and model-**predicted** values is created 4. *`metrica`* package is used to compute goodness of fit and error metrics based on observed and predicted values 5. *`metrica`* package is used to visualize model fit and selected fit metrics This vignette introduces the functionality of the *`metrica`* package applied to observed and model-predicted values of wheat grain nitrogen (N) content (in grams of N $m^{-2}$). ## 2. Wheat grain N content Let's begin by loading the packages needed. ```{r libraries, message=F, warning=F} library(ggplot2) library(dplyr) library(metrica) ``` Now we load the `wheat` data set included in the `metrica` package. ```{r load data} # Load data(wheat) # Printing first observations head(wheat) ``` This data set contains two columns: - **pred**: model-predicted wheat grain N content, in g N $m^{-2}$, - **obs**: ground-truth observed wheat grain N content, in g N $m^{-2}$ ## 3. Visual assessment of agreement ### 3.1 Scatterplot of pred vs. obs The simplest way to visually assess agreement between observed and predicted values is with a scatterplot. We can use the function `scatter_plot()` from the *metrica* package to create a scatterplot. The function requires specifying at least: - the data frame object name (`data` argument) - the name of the column containing observed values (`obs` argument) - the name of the column containing predicted values (`pred` argument) Besides a scatterplot, this function also adds to the plot the **1:1 line** (solid line) and the **linear regression line** (dashed line). ```{r scatter_plot PO, fig.width=5, fig.height=4, dpi=90, warning=FALSE, message=FALSE} scatter_plot(data = wheat, obs = obs, pred = pred) ``` The default behavior of `scatter_plot()` places the `obs` column on the x axis and the `pred` column on the y axis (`orientation = "PO"`). This can be inverted by changing the argument `orientation` to "OP": ```{r scatter_plot OP, fig.width=5, fig.height=4, dpi=90, warning=FALSE, message=FALSE} scatter_plot(data = wheat, obs = obs, pred = pred, orientation = "OP") ``` The output of the `scatter_plot()` function is a `ggplot2` object that can be further customized: ```{r scatter_plot custom, fig.width=5, fig.height=4, dpi=90, warning=FALSE, message=FALSE} scatter_plot(data = wheat, obs = obs, pred = pred, orientation = "OP", regline_color = "#d0f4de", shape_color = "#80ed99", eq_color = "white", )+ labs(x ="Predicted wheat N content (g N/m2)", y = "Observed wheat N content (g N/m2)")+ theme_dark() ``` ## 3.2 Bland-Altman plot The Bland-Altman plot is another way of visually assessing observed vs. predicted agreement. It plots the difference between observed and predicted values on the y axis, and the observed values on the x axis: ```{r bland-altman, fig.width=5, fig.height=4, dpi=90, warning=FALSE, message=FALSE} bland_altman_plot(data = wheat, obs = obs, pred = pred) ``` ## 4. Numerical assessment of agreement The *metrica* package contains functions for **41 metrics** to assess agreement between observed and predicted values for continuous data (i.e., regression error). A list with all the the metrics including their name, definition, details, formula, and function name, please check [here]. All of the metric functions take the same three arguments as the plotting functions: - the data frame object name (`data` argument) - the name of the column containing observed values (`obs` argument) - the name of the column containing predicted values (`pred` argument) The user can choose to calculate a single metric, or to calculate all metrics at once. To calculate a single metric, the metric function can be called. For example, to calculate $R^{2}$, we can use the `R2()` function: ```{r r2} R2(data = wheat, obs = obs, pred = pred, tidy = TRUE) ``` Similarly, to calculate root mean squared error, we can use the `RMSE()` function: ```{r rmse} RMSE(data = wheat, obs = obs, pred = pred) ``` The user can also calculate all 41 metrics at once using the function `metrics_summary()`: ```{r metrics summary} metrics_summary(data = wheat, obs = obs, pred = pred, type = "regression") ``` If the user wants just specific metrics, within the same function `metrics_summary()`, user can pass a list of desired metrics using the argument "metrics_list" as follows: ```{r metrics summary list} my.metrics <- c("R2","MBE", "RMSE", "RSR", "NSE", "KGE", "CCC") metrics_summary(data = wheat, obs = obs, pred = pred, type = "regression", metrics_list = my.metrics) ``` ## 5. Time series
### 5.1. Example of timeseries prediction In some cases, we may count with time-series predictions (e.g. cumulative values from daily simulations). For example, let's say that we evaluate the production of drymass during the season. For this specific case, the Mean Absolute Scaled Error is a more solid metric compared to conventional RMSE or similar metrics.
Let's suppose that we have predictions of wheat grain N over the years on the same location for a series of 20 years from 2001 to 2020. Thus, we may get a random sample from the wheat data set and assume they represent the time series of interest. Therefore, we create a new `time` variable called `Year` that will serve to sort the observations. ```{r metrics time-series, fig.width=6, fig.height=5, dpi=90} set.seed(165) wheat_time <- metrica::wheat %>% sample_n(., size = 20) %>% mutate(Year = seq(2001,2020, by =1)) # Plot wheat_time %>% ggplot2::ggplot(aes(x = Year))+ geom_point(aes(y = pred, fill = "Predicted", shape = "Predicted"))+ geom_point(aes(y = obs, fill = "Observed", shape = "Observed"))+ geom_line(aes(y = pred, col = "Predicted", linetype = "Predicted"), size = .75)+ geom_line(aes(y = obs, col = "Observed", linetype = "Observed"), size = .75)+ scale_fill_manual(name = "", values = c("dark red","steelblue"))+ scale_shape_manual(name = "", values = c(21,24))+ scale_color_manual(name = "", values = c("dark red","steelblue"))+ scale_linetype_manual(name = "", values = c(1,2))+ labs(x = "Year", y = "Wheat Grain N (g/m2)")+ theme_bw()+ theme(legend.position = "top") ``` ### 5.2. Use MASE for timeseries
In the case of timeseries analysis, the Mean Absolute Scaled Error (MASE, Hyndman & Koehler, 2006), a scaled error metric, is preferable over other classic metrics such as the RMSE. With `metrica`, we can use the function MASE. Please, be aware that MASE requires the `obs` and `pred` data along with a third column corresponding to the temporal variable that sorts the data (use the `time` argument to specify it). The default method to scale the MASE is the `naive` forecast (random-walk), which requires the user to define the size of the `naive_step`. Otherwise, an out-of-bag MAE can be specified with the `oob_mae` argument.
```{r MASE} # MASE estimate, with naive approach (random-walk, i.e. using observation of t-1 as prediction) metrica::MASE(data = wheat_time, obs = obs, pred = pred, naive_step = 1, tidy = FALSE, time = "Year") metrica::MASE(data = wheat_time, obs = obs, pred = pred, naive_step = 1, tidy = FALSE) # MASE estimate, with mae coming from an independent training set. metrica::MASE(data = wheat_time, obs = obs, pred = pred, naive_step = 1, tidy = FALSE, time = "Year", oob_mae = 6) ``` ## 6. Visual and numerical assessment combined The user can also create a scatter plot that includes not only the **predicted** vs. **observed** points, **1:1 line**, and **regression line**, but also **selected metrics and their values** plus the **SMA regression equation**. This is accomplished with the function `scatter_plot()`: ```{r scatter_plot, fig.width=6, fig.height=5, dpi=90} scatter_plot(data = wheat, obs = obs, pred = pred) ``` To print the metrics on the `scatter_plot()`, just use print.metrics. Warning: do not forget to specify your 'metrics.list': ```{r scatter_plot print_metrics, fig.width=6, fig.height=5, dpi=90} my.metrica.plot <- scatter_plot(data = wheat, obs = obs, pred = pred, print_metrics = TRUE, metrics_list = my.metrics) my.metrica.plot ``` Also, as a ggplot element, outputs are flexible of further edition: ```{r scatter_plot.edit, fig.width=6, fig.height=5, dpi=90} my.metrica.plot + # Modify labels labs(x = "Observed (days to emergence)", y = "Predicted (days to emergence)")+ # Modify theme theme_light() my.metrica.plot + # Modify labels labs(x = "Observed (Mg/ha)", y = "Predicted (Mg/ha)")+ # Modify theme theme_dark() ``` ## 7. Exporting To export the metrics summary table, the user can simply write it to file with the function `write.csv()`: ```{r export metrics_summary, eval=F } metrics_summary(data = wheat, obs = obs, pred = pred, type = "regression") %>% write.csv("metrics_summary.csv") ``` Similarly, to export a plot, the user can simply write it to file with the function `ggsave()`: ```{r export plot, eval=F} ggsave(plot = my.metrica.plot, "scatter_metrics.png", width = 5, height = 5) ```