--- title: "Setup of data for an onlineforecast model" author: "Peder Bacher" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_debth: 3 vignette: > %\VignetteIndexEntry{Setup of data for an onlineforecast model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r external-code, cache=FALSE, include=FALSE, purl = FALSE} # Have to load the knitr to use hooks library(knitr) ## This vignettes name vignettename <- "setup-data" # REMEMBER: IF CHANGING IN THE shared-init (next block), then copy to the others! ``` ```{r init, cache=FALSE, include=FALSE, purl=FALSE} # Width will scale all figwidth <- 12 # Scale the wide figures (100% out.width) figheight <- 4 # Heights for stacked time series plots figheight1 <- 5 figheight2 <- 6.5 figheight3 <- 8 figheight4 <- 9.5 figheight5 <- 11 # Set the size of squared figures (same height as full: figheight/figwidth) owsval <- 0.35 ows <- paste0(owsval*100,"%") ows2 <- paste0(2*owsval*100,"%") # fhs <- figwidth * owsval # Set for square fig: fig.width=fhs, fig.height=fhs, out.width=ows} # If two squared the: fig.width=2*fhs, fig.height=fhs, out.width=ows2 # Check this: https://bookdown.org/yihui/rmarkdown-cookbook/chunk-styling.html # Set the knitr options knitr::opts_chunk$set( collapse = TRUE, comment = "##!! ", prompt = FALSE, cache = TRUE, cache.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"), fig.align="center", fig.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"), fig.height = figheight, fig.width = figwidth, out.width = "100%" ) options(digits=3) # For cropping output and messages cropfun <- function(x, options, func){ lines <- options$output.lines ## if (is.null(lines)) { ## return(func(x, options)) # pass to default hook ## } if(!is.null(lines)){ x <- unlist(strsplit(x, "\n")) i <- grep("##!!",x) if(length(i) > lines){ # truncate the output, but add .... #x <- c(x[-i[(lines+1):length(i)]], "```", "## ...output cropped", "```") x <- x[-i[(lines+1):length(i)]] x[i[lines]] <- pst(x[i[lines]], "\n\n## ...output cropped") } # paste these lines together x <- paste(c(x, ""), collapse = "\n") } x <- gsub("!!","",x) func(x, options) } hook_chunk <- knit_hooks$get("chunk") knit_hooks$set(chunk = function(x, options) { cropfun(x, options, hook_chunk) }) ``` [onlineforecasting]: https://arxiv.org/abs/2109.12915 [building heat load forecasting]: https://onlineforecasting.org/examples/building-heat-load-forecasting.html [onlineforecasting.org]: https://onlineforecasting.org ## Intro This vignette explains how to setup data consisting of observations and forecasts, such that it can be used for onlineforecast models. A generic introduction and description is in available in [onlineforecasting]. The code is available [here](setup-data.R). More information on [onlineforecasting.org]. ## Data example First load the package: ```{r} # Load the package library(onlineforecast) ``` In the package different data sets are included. The `Dbuilding` holds the data used for the example of heat load forecasting in the building-heat-load-forecasting vignette. When the package is loaded the data is also loaded, so we can access it directly. Let's start out by: ```{r} # Keep it in D to simplify notation D <- Dbuilding ``` The class is 'data.ĺist': ```{r} # The class of D class(D) ``` Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in order to have functions for the particular format of data - the format is explained in this document. It consists of vectors of time, vectors of observations (model output) and data.frames of forecasts (model input): ```{r} # Print the names to see the variables in the data names(D) ``` An overview of the content can be generated by: ```{r} summary.default(D) ``` where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames. A function giving a summary, including checks of the format of the 'data.list' is: ```{r} summary(D) ``` The 'NA' columns indicate the proportion of NAs. If there is a `ok` in a column, then the check of the variables format is passed. See the help with `?summary.data.list` to learn which checks are performed. ### Time First, lets have a look at `D$t`, which is the vector of time points: ```{r} # The time class(D$t) head(D$t) tail(D$t) ``` Hence, the vector is of the class `POSIXct`. It is not a necessity, `t` can also simply be a numeric, but for plotting and many operations, its very useful to use the 'POSIXct' class (see `?POSIXt`). Rules for the time vector: - It must be named `t`. - There must be no gaps or NA values in `t`, since only equidistant time series can be used in the models (the other variables can have NAs). - Its best to keep the time zone in `UTC` or `GMT` (not providing any time zone `tz` can give rise to problems). Use the basic R functions for handling the time class. Most needed operations can be done with: ```{r, eval=FALSE} ?as.POSIXct ?strftime ``` A helper function is provided with the `ct` function which can be called using `?`, or `?ct`. See example below: ```{r} # Convert from a time stamp (tz="GMT" per default) ct("2019-01-01 11:00") # Convert from unix time ct(3840928387) ``` Note that for all functions where a time value as a character is given, the time zone is always "GMT" (or "UTC", but this can result in warnings, but they can be ignored). For some operations the package `lubridate` can be very helpful. ### Observations Note the rules for observations: - In a `data.list` observations must be vectors. - The vectors must have the same length as the time `t` vector. - Observation as numerical vectors can be used directly as model output (if observations are to used as model inputs, they must be setup in a data.frame as explained below in Section [Forecasts]). In the current data, a time series of hourly heat load observations is included: ```{r} str(D$heatload) ``` It must have the same length as the time vector: ```{r} # Same length as time length(D$t) length(D$heatload) ``` A simple plot can be generated by: ```{r} plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)") ``` The convention used in all examples is that the time points are always set to the time interval end point, e.g.: ```{r} # The observation D$heatload[2] # Represents the average load between D$t[1] # and D$t[2] ``` The main idea behind setting the time point at the end of the interval is: Working with values averaged over the time interval, such values are available at the end of the time interval, not before. Especially, in real-time applications this is a useful convention. ### Forecasts As described in [onlineforecasting] the setup of forecasts for model inputs always follows the same format - as presented in the following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format. The rules are: - All values at row `i` are available at the `i`'th value in time `t`. - All columns must be named with `k` followed by an integer indicating the horizon in steps (e.g. the column named `k8` hold the 8-step forecasts). Have a look at the forecasts of the global radiation: ```{r} # Global radiation forecasts head(D$I) ``` At the first time point: ```{r} # First time point D$t[1] ``` the available forecast ahead in time is at the first row: ```{r} # The forecast available ahead in time is in the first row D$I[1, ] ``` We can plot that by: ```{r} i <- 1:ncol(D$I) plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)") ``` So this is the forecast available ahead in time at `r D$t[1]`. The column in `I` named `k8` holds the 8-step horizon forecasts, which, since the steps are hourly, is an equi-distant time series. Picking out the entire series can be done by `D$I$k8` - hence a plot (together with the observations) can be generated by: ```{r} # Just pick some points by i <- 200:296 plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)") # Add the observations lines(D$t[i], D$Iobs[i]) legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1) ``` Notice how the are not aligned, since the forecasts are 8 hours ahead. To align them the forecasts must be lagged 8 steps by: ```{r} plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)") lines(D$t[i], D$Iobs[i]) legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1) ``` ## Plotting A few simple plotting functions are included in the package. ### Time series plots The plot function provided with the package actually does this lagging with plotting forecasts: ```{r} plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36)) ``` The argument `patterns` is vector of a regular expressions (see `?regex`), which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details. An interactive plot can be generated using (first install the package `plotly`): ```{r, eval=FALSE} plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36)) ``` ```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide", eval=FALSE} L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE) plotly::subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE) ``` Note that the `patterns` argument is a vector of regular expressions, which determines which variables from `D` to plot. ### Scatter plots When modelling with the objective of forecasting, it's always a good start to have a look at scatter plots between the model inputs and the model output. For example the heatload vs. ambient temperature 8-step forecast: ```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2} par(mfrow=c(1,2)) plot(D$Ta$k8, D$heatload) plot(lagvec(D$Ta$k8, 8), D$heatload) ``` So lagging (thus aligning in time) makes less slightly less scatter. A wrapper for the `pairs` function is provided for a `data.list`, which can generate very useful explorative plots: ```{r, fig.height=figwidth} pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24)) ``` Note how the sequence of included horizons are specified in the `kseq` argument, and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details. Just as a quick side note: This is the principle used for fitting onlineforecast models, simply shift forecasts to align with the observations: ```{r, fig.width=fhs, fig.height=fhs, out.width=ows} # Lag the 8-step forecasts to be aligned with the observations x <- lagvec(D$I$k8, 8) # Take a smaller range x <- x[i] # Take the observations y <- D$Iobs[i] # Fit a linear regression model fit <- lm(y ~ x) # Plot the result plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation") abline(fit) ``` Seen over time the 8-step forecasts are: ```{r} plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2) lines(D$t[i], y) legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1) ``` Of course that model was very simple, see how to make a better model in [building-heat-load-forecasting] and more information on the [website]. ## Subset Taking a subset of a `data.list` is very useful and it can easily be done in different ways using the `subset` function (i.e. it's really the `subset.data.list` function called when: ```{r} # Take the 1 to 4 values of each variable in D Dsub <- subset(D, 1:4) summary(Dsub) ``` Another useful function for taking data in a time range is: ```{r} which(in_range("2010-12-20",D$t,"2010-12-21")) ``` always check the help of function for more details (i.e. `?in_range`) Actually, it's easy to take subset from a period by: ```{r} Dsub <- subset(D, c("2010-12-20","2010-12-21")) summary(Dsub) Dsub$t ``` ## Data.list to data.frame (or data.table) It can be really useful to bring the data.list on a format of a `data.frame` or equivalently `data.table` for processing. Bringing to `data.frame` can easily be done by: ```{r} Df <- as.data.frame(Dsub) names(Df) ``` So the forecasts are just bind with the time and observations, and `.kxx` is added to the column names. It can be converted to a `data.table` by: ```{r} library(data.table) setDT(Df) class(Df) ``` After processing it is easily converted back to the `data.list` again by: ```{r} # Set back to data.frame setDF(Df) # Convert to a data.list Dsub2 <- as.data.list(Df) # Compare it with the original Dsub summary(Dsub2) summary(Dsub) ```