--- title: "Datetime Partitioning" author: "Sergey Yurgenson, Madeleine Mott, Zach Mayer, Igor Veksler, Thakur Raj Anand" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Datetime Partitioning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ### Datetime Partitioning Background When dividing your data for model training and validation, DataRobot will typically choose random rows of your dataset to assign amongst different cross validation folds. This will verify you have not overfit your model to the training set and the model can perform well on new data. However when your data has an intrinsic time based component, you have to be careful to always use data from the past to predict the future and never use the future to predict the past. The latter is known as lookahead bias and can be thought of as another form of a data leak. DataRobot now posses datetime partitioning which will be diligent within model training & validation to guard against lookahead bias. Let's look at how we would frame a problem with a time component within DataRobot. We will use a sample dataset from LendingClub, similar to the Prediction Explanations Vignette. We want to train the model on historical loans and validate on recent loans and would therefore like to use a datetime Partition. Cross Validation folds are now known as `Backtests` with each backtest corresponding to a sliding window of historical training data and more recent validation data. By default DataRobot will create a single backtesting window. We can control the number of backtests to use (up to 10), so we will use 5 as a best practice similar to a cross sectional problem. ### Load the useful libraries Let's load **datarobot** ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} library(datarobot) ``` ### Running a DataRobot Project with a datetime partition ```{r datetime Partition Base, echo = TRUE, eval = FALSE} lending <- read.csv("https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv") partition <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "earliest_cr_line", numberOfBacktests = 5) proj <- StartProject(dataSource = lending, projectName = "Lending_Club_Time_Series", target = "is_bad", mode = "quick", partition = partition) ``` We took advantage of DataRobot's automated partition date selection after we specified the number of backtests to use. DataRobot allows further control, where we can further specify the validation start date as well as duration. Let's look at an example below. #### Create Backtest Specifications ```{r backtest_specification_example, echo = TRUE, eval = FALSE} backtest <- list() # Dates are not project specific but rather example dates backtest[[1]] <- CreateBacktestSpecification(0, ConstructDurationString(), "1989-12-01", ConstructDurationString(days = 100)) backtest[[2]] <- CreateBacktestSpecification(1, ConstructDurationString(), "1999-10-01", ConstructDurationString(days = 100)) # create desired partition specification partition <- CreateDatetimePartitionSpecification("earliest_cr_line", numberOfBacktests = 2, backtests = backtest) ``` Let's continue with our original project. Often when training time-based models we would like to iterate within our workflow by by running all the backtest folds within a model to verify its stability. Finally we can retrain the best model on a larger or more recent time slice to prepare the model for model for deployment. Let's look how we can accomplish these actions below: ### Model Iteration ```{r model_iteration, echo = TRUE, eval = FALSE} # Request more granular information on the datetime partition specification GetDatetimePartition(proj) # View blueprints associated with a project bps <- ListBlueprints(proj) # View the the models within the model leaderboard models <- ListModels(proj) # Retrieve a datetime model. There is now a new retrieval function specific to datetime partitioning dt_model <- GetDatetimeModel(proj, models[[1]]$modelId) # Score all Backtests scoreJobId <- ScoreBacktests(dt_model) WaitForJobToComplete(proj, scoreJobId) # To make synchronous # now model information will also contain information about backtest scores dtModelWithBt <- GetDatetimeModel(proj, dt_model$modelId) # Retrain a model using a different start & end date. # One has to request a `Frozen` model to keep the hyper-parameters static and avoid lookahead bias. # Within the context of deployment, this can be used to retrain a resulting model on more recent data. UpdateProject(proj, holdoutUnlocked = TRUE) # If retraining on 100% of the data, we need to unlock the holdout set. modelJobId_frozen <- RequestFrozenDatetimeModel(dt_model, trainingStartDate = as.Date("1950/12/1"), trainingEndDate = as.Date("1998/3/1")) new_dt_model_frozen <- GetDatetimeModelFromJobId(proj, modelJobId_frozen) # Train & retrieve a new date-time model based on rowcount modelJobId <- RequestNewDatetimeModel(proj, bps[[1]], trainingRowCount = 100) new_dt_model <- GetDatetimeModelFromJobId(proj, modelJobId) # Train & retrieve a new date-time model based on duration modelJobId <- RequestNewDatetimeModel(proj, bps[[1]], trainingDuration = ConstructDurationString(months=10)) new_dt_model <- GetDatetimeModelFromJobId(proj, modelJobId) ```