--- title: "Introduction to Time Series" author: "Peter Hurford, Madeleine Mott" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Introduction to Time Series} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- DataRobot now includes the ability to make time series projects via the API. Time series projects, like OTV projects, use datetime partitioning, and all the workflow changes that apply to other datetime partitioned projects also apply to them. Unlike other projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row. DataRobot uses a general time series framework to configure how time series features are created and what future values the models will output. This framework consists of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict). These components are described in more detail below. Time series projects will automatically transform the dataset provided in order to apply this framework. During the transformation, DataRobot uses the Feature Derivation Window to derive time series features (such as lags and rolling statistics), and uses the Forecast Window to provide examples of forecasting different distances in the future (such as time shifts). After project creation, a new dataset and a new feature list are generated and used to train the models. This process is reapplied automatically at prediction time as well in order to generate future predictions based on the original data features. The `timeUnit` and `timeStep` used to define the Feature Derivation and Forecast Windows are taken from the datetime partition column, and can be retrieved for a given column in the input data by using `GetFeatureInfo`. ## Setting Up A Time Series Project To set up a time series project, use the new time series specific parameters found in `CreateDatetimePartitionSpecification`: * **useTimeSeries** - set this to TRUE to enable time series for the project. * **defaultToKnownInAdvance** - set this to TRUE to default to treating all features as known in advance features. Otherwise they will not be handled as known in advance features. See the prediction documentation for more information. * **featureDerivationWindowStart** - the offset into the past to the start of the feature derivation window. * **featureDerivationWindowEnd** - the offset into the past to the end of the feature derivation window. * **forecastWindowStart** - the offset into the future to the start of the forecast window. * **forecastWindowEnd** - the offset into the future to the end of the forecast window. * **featureSettings** - A list of settings. Can be used to set individual features to "known in advance". * **treatAsExponential** - Used to specify whether to treat the data as an exponential trend, which will apply a log-transform. By default, set as "auto", this can be inferred automatically. See possible values in `TreatAsExponential`. * **differencingMethod** - Used to specify a differencing method to apply if data is stationary. By default, set as "auto", this can be inferred automatically. See possible values in `DifferenicngMethod`. * **periodicities** - A list of periodicities of different timestamps, specified in a list of lists. * **windowsBasisUnit** - The unit to use for feature derivation and forecast windows. Defaults to the inferred time step. If `"ROW"`, will define the window with a number of rows. When using datasets to a time series project, the dataset might look something like the following, if `time` is the datetime partition column, `target` is the target column, and `temp` is an input feature. If the dataset was uploaded with a forecast point of "2017-01-08" and during partitioning the feature derivation window start and end were set to -5 and -3 and the forecast window start and end were set to 1 and 3, then rows 1 through 3 are historical data, row 6 is the forecast point, and rows 7 though 9 are forecast rows that will have predictions when predictions are computed. ```{r, echo = FALSE} library(knitr) data <- data.frame(row = seq(9), time = as.Date("2017-01-02") + seq(9), target = c(16443, 3013, 1643, rep(NA, 6)), temp = c(72, 72, 68, rep(NA, 6))) kable(data) ``` On the other hand, if the project instead used `holiday` as a known in advance input feature, the uploaded dataset might look like the following: ```{r, echo = FALSE} library(knitr) data <- data.frame(row = seq(9), time = as.Date("2017-01-02") + seq(9), target = c(16443, 3013, 1643, rep(NA, 6)), holiday = c(TRUE, rep(FALSE, 5), TRUE, rep(FALSE, 2))) kable(data) ``` Here's a simple example of using this data in a time series project: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", useTimeSeries = TRUE) StartProject(dataSource = data, target = "target", partition = partition, metric = "RMSE") ``` ## Feature Derivation Window The Feature Derivation window represents the rolling window that is used to derive time series features and lags, relative to the Forecast Point. It is defined in terms of `featureDerivationWindowStart` and `featureDerivationWindowEnd` which are integer values representing datetime offsets in terms of the `timeUnit` (e.g. hours or days). The Feature Derivation Window start and end must be less than or equal to zero, indicating they are positioned before the forecast point. Additionally, the window must be specified as an integer multiple of the `timeStep` which defines the expected difference in time units between rows in the data. Enough rows of historical data must be provided to cover the span of the effective Feature Derivation Window (which may be longer than the project's Feature Derivation Window depending on the differencing settings chosen). The effective Feature Derivation Window of any model can be checked via the `effectiveFeatureDerivationWindowStart` and `effectiveFeatureDerivationWindowEnd` attributes of a datetime model. See `GetDatetimeModel`. The window is closed, meaning the edges are considered to be inside the window. This information is added to your `CreateDatetimePartitionSpecification` call like so: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", featureDerivationWindowStart = -24, featureDerivationWindowEnd = -12, useTimeSeries = TRUE) ``` ## Forecast Window The Forecast Window represents the rolling window of future values to predict, relative to the Forecast Point. It is defined in terms of the `forecastWindowStart` and `forecastWindowEnd`, which are positive integer values indicating datetime offsets in terms of the `timeUnit` (e.g. hours or days). The Forecast Window start and end must be positive integers, indicating they are positioned after the forecast point. Additionally, the window must be specified as an integer multiple of the `timeStep` which defines the expected difference in time units between rows in the data. The window is closed, meaning the edges are considered to be inside the window. This information is added to your `CreateDatetimePartitionSpecification` call like so: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", forecastWindowStart = 1, forecastWindowEnd = 10, useTimeSeries = TRUE) ``` ## Modeling Data and Time Series Features In time series projects, a new set of modeling features is created after setting the partitioning options. If a featurelist is specified with the partitioning options, it will be used to select which features should be used to derived modeling features; if a featurelist is not specified, the default featurelist will be used. These features are automatically derived from those in the project's dataset and are the features used for modeling - note that `ListFeaturelists` and `ListModelingFeaturelists` will return different data in time series projects. Modeling featurelists are the ones that can be used for modeling and will be accepted by the backend, while regular featurelists will continue to exist but cannot be used. Modeling features are only accessible once the target and partitioning options have been set. In projects that don't use time series modeling, once the target has been set, both modeling and regular features and featurelists will behave the same. ## Making Predictions Prediction datasets are uploaded as normal predictions. However, when uploading a prediction dataset, a new parameter `forecastPoint` can be specified. The forecast point of a prediction dataset identifies the point in time relative which predictions should be generated, and if one is not specified when uploading a dataset, the server will choose the most recent possible forecast point. The forecast window specified when setting the partitioning options for the project determines how far into the future from the forecast point predictions should be calculated. For example: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} predictions <- Predict(timeSeriesModel, testData, forecastPoint = "1958-01-01") ``` ## Feature Settings When setting up a time series project, input features could be identified as known in advance features. These features are not used to generate lags, and are expected to be known for the rows in the forecast window at predict time (e.g. "how much money will have been spent on marketing", "is this a holiday"). To start a time series project, use `CreateDatetimePartitionSpecification` and specify the `feaureSettings`. (Note that this is for illustrative purposes only - this project will not actually build because the 10 data points are smaller than the 100 datapoint minimum required.) ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", useTimeSeries = TRUE, featureSettings = list("featureName" = "holiday", "knownInAdvance" = TRUE)) project <- StartProject(data, projectName = "test-TimeSeries", target = "target", partition = partition, metric = "RMSE") ``` If you have another known in advance feature (e.g., `weekend`), just add it to feature settings as a list of lists: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", useTimeSeries = TRUE, featureSettings = list(list("featureName" = "holiday", "knownInAdvance" = TRUE), list("featureName" = "weekend", "knownInAdvance" = TRUE))) ``` ## Formatting Durations Some parts of time series require specifying durations. When date ranges are specified with a start and an end date, the end date is exclusive, so only dates earlier than the end date are included, but the start date is inclusive, so dates equal to or later than the start date are included. If the start and end date are the same, then no dates are included in the range. Durations are specified using a subset of ISO8601. Durations will be of the form PnYnMnDTnHnMnS where each “n” may be replaced with an integer value. Within the duration string, * nY represents the number of years * the nM following the “P” represents the number of months * nD represents the number of days * nH represents the number of hours * the nM following the “T” represents the number of minutes * nS represents the number of seconds * “P” is used to indicate that the string represents a period and “T” indicates the beginning of the time component of the string. Any section with a value of 0 may be excluded. As with datetimes, if the partition column did not include a time component in its date format, the time component of any duration must be either unspecified or consist only of zeros. Example Durations: * “P3Y6M” (three years, six months) * “P1Y0M0DT0H0M0S” (one year) * “P1Y5DT10H” (one year, 5 days, 10 hours) ## Multiseries The API also supports **multiseries**, or data with multiple time series delineated by multiseries ID columns. To create this, create a project, then create a datetime partition specification that specifies the `datetimePartitionColumn` (the column with your date in it) and the `multiseriesIdColumns` (a list of columns specifying the ids that delineate the multiseries). ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} data <- read.csv(system.file("extdata", "multiseries.csv", package = "datarobot")) partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", useTimeSeries = TRUE, multiseriesIdColumns = "series_id") project <- StartProject(data, projectName = "test-TimeSeries", target = "target", partition = partition, metric = "RMSE", mode = AutopilotMode$Manual, targetType = "Regression") ``` ## Prediction Intervals For each model, prediction intervals estimate the range of values DataRobot expects actual values of the target to fall within. They are similar to a confidence interval of a prediction, but are based on the residual errors measured during the backtesting for the selected model. Note that because calculation depends on the backtesting values, prediction intervals are not available for predictions on models that have not had all backtests completed. Additionally, prediction intervals are not available when the number of points per forecast distance is less than 10, due to insufficient data. In a prediction request, users can specify a prediction intervals size, which specifies the desired probability of actual values falling within the interval range. Larger values are less precise, but more conservative. For example, specifying a size of 80 will result in a lower bound of 10% and an upper bound of 90%. More generally, for a specific `predictionIntervalsSize`, the upper and lower bounds will be calculated as follows: * predictionIntervalUpperBound = 50% + (`predictionIntervalsSize` / 2) * predictionIntervalLowerBound = 50% - (`predictionIntervalsSize` / 2) To view prediction intervals data for a prediction, the prediction needs to have been created using `Predict` and specifying `includePredictionIntervals = TRUE`. The size for the prediction interval can be specified with the `predictionIntervalsSize` parameter for the same function, and will default to 80 if left unspecified. Specifying these fields will result in prediction interval bounds being included in the retrieved prediction data for that request. See `Predict` for more details. ## Disabling Derived Features DataRobot does a lot of good work to automatically derive features that may be useful (e.g., lags). You can always see these features clearly by calling `GetTimeSeriesFeatureDerivationLog`. However, from time to time, it may be useful to disable DataRobot's automatic feature engineering for a particular feature (e.g., so you can derive lags yourself manually). To do this, we can use the `featureSettings` to turn off derived features for a particular base feature: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "timestamp", useTimeSeries = TRUE, featureSettings = list(list("featureName" = "sales", "doNotDerive" = TRUE))) ```