--- title: "Introduction to the DataRobot R Package" author: "Ron Pearson, Peter Hurford, Matt Harris" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Introduction to the DataRobot R Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The name DataRobot refers to three things: a Boston-based software company, the massively parallel modeling engine developed by the DataRobot company, and an open-source R package that allows interactive R users to connect to this modeling engine. This vignette provides a brief introduction to the datarobot R package, highlighting the following key details of its use: * connecting to the DataRobot modeling engine from an interactive R session; * creating a new modeling project in the DataRobot modeling engine; * retrieving the results from a DataRobot modeling project; * generating predictions from any DataRobot model. To illustrate how the datarobot package is used, it is applied here to the `Ames` dataframe from the `AmesHousing` package, providing simple demonstrations of all of the above steps. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} library(datarobot) ``` ## The DataRobot modeling engine The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. The datarobot R package described here allows anyone with access to one of these implementations to interact with it from an interactive R session. Connection between the R session and the modeling engine is accomplished via HTTP requests, with an initial connection established in one of two ways described in the next section. The DataRobot modeling engine is organized around *modeling projects*, each based on a single data source, a single target variable to be predicted, and a single metric to be optimized in fitting and ranking project models. This information is sufficient to create a project, identified by a unique alphanumeric `projectId` label, and start the DataRobot Autopilot, which builds, evaluates, and summarizes a collection of models. While the Autopilot is running, intermediate results are saved in a list that is updated until the project completes. The last stage of the modeling process constructs *blender* models, ensemble models that combine two or more of the best-performing individual models in various different ways. These models are ranked in the same way as the individual models and are included in the final project list. When the project is complete, the essential information about all project models may be obtained with the `ListModels` function described later in this note. This function returns an S3 object of class 'listOfModels', which is a list with one element for each project model. A plot method has been defined for this object class, providing a convenient way to visualize the relative performance of these project models. ## Connecting to DataRobot To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an `endpoint` - the URL address of the specific DataRobot server being used - and a `token`, a previously validated access token. `token` is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section. It looks like a string of letters and numbers. `endpoint` depends on DataRobot modeling engine installation (cloud-based, on-prem...) you are using. Contact your DataRobot admin for endpoint to use. The `endpoint` for DataRobot cloud accounts is `https://app.datarobot.com/api/v2` The first access method uses a YAML configuration file with these two elements - labeled `token` and `endpoint` - located at $HOME/.config/datarobot/drconfig.yaml. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established. It is also possible to establish a connection using this YAML file via the ConnectToDataRobot function, by specifying the configPath parameter. The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the `endpoint` and `token` parameters. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} ConnectToDataRobot(endpoint = "YOUR-ENDPOINT-HERE", token = "YOUR-API_TOKEN-HERE") ``` DataRobot API can work behind a non-transparent HTTP proxy server. Please set environment variable `http_proxy` containing proxy URL to route all the DataRobot traffic through that proxy server, e.g. `http_proxy="http://my-proxy.local:3128" R -f my_datarobot_script.r`. ## Creating a new project One of the most common and important uses of the datarobot R package is the creation of a new modeling project. This task is supported by the following three functions: * __StartProject__ creates a new project, generating a unique alphanumeric project identifier (__projectId__), uploading the modeling data, allows the specification of a project name, specifies the target variable, optionally specifies a fitting metric (if none is specified, the DataRobot Autopilot selects one), supports a number of other optional parameter specifications (refer to the help file for __SetTarget__ for details), and starts the model-fitting task; * __GetValidMetrics__ allows the user to obtain a list of valid fitting metrics for the intended target variable; The first step in creating a new DataRobot modeling project uses the `StartProject` function, which has one required parameter, `dataSource`, that can be a dataframe, an object whose class inherits from dataframe (e.g., a `data.table`), or a CSV file. Although it is not required, the optional parameter `projectName` can be extremely useful in managing projects, especially as their number grows; in particular, while every project has a unique alphanumeric identifier `projectId` associated with it, this string is not easy to remember. Another optional parameter is `maxWait`, which specifies the maximum time in seconds before the project creation task aborts; increasing this parameter from its default value can be useful when working with large datasets. `StartProject` also starts the model-building process by specifying a `target`, a character string that names the response variable to be predicted by all models in the project. Of the optional parameters for the `StartProjct` function, the only one discussed here is `metric`, a character string that specifies the measure to be optimized in fitting project models. Admissible values for this parameter are determined by the DataRobot modeling engine based on the nature of the `target` variable. A list of these values can be obtained using the function `GetValidMetrics`. The required parameters for this function are `project` and `target`, but here there are no optional parameters. The default value for the optional `metric` parameter in the `StartProject` function call is `NULL`, which causes the default metric recommended by the DataRobot modeling engine to be adopted. For a complete discussion of the other optional parameters for the `StartProject` function, refer to the help files. #### Offsets DataRobot also supports using an offset parameter in `StartProject`. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns. #### Exposure DataRobot also supports using an exposure parameter in `StartProject`. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column. ## The Ames housing dataframe ```{r, echo = FALSE, message = FALSE} library(AmesHousing) Ames <- make_ames() Ames <- Ames[sapply(Ames,is.numeric)] ``` To provide a specific illustration of how a new DataRobot project is created, the following discussion shows the creation of a project based on the `Ames` dataframe from the `AmesHousing` package. This dataframe characterizes housing prices in Ames, Iowa from 2006 to 2010 with 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. For this vignette, only the numeric variables are used (DataRobot can handle non-numeric, text, image and geospatial variables). See DeCock (2011) "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project" *Journal of Statistics Education*, vol. 19, no. 3, 2011). The dataframe is described in more detail in the associated help file from the `AmesHousing` package, but the `head` function shows its basic structure: ```{r, echo = TRUE, message = FALSE} head(Ames) ``` To create the modeling project for this dataframe, we first use the `StartProject` function: ```{r, echo = TRUE, eval = FALSE} project <- StartProject(dataSource = Ames, projectName = "AmesVignetteProject", target = "Sale_Price", wait = TRUE) ``` `dataSource` defines the data used for predictions, `projectName` defines the name of the project, `target` defines what to predict, and `wait = TRUE` tells the function to wait until all modeling is complete before executing other computations. The list returned by this function gives the project name, the project identifier (`projectId`), the name of the temporary CSV file used to save and upload the `Ames` dataframe, and the time and date the project was created. Here, we specify `Sale_Price` (sale price of the home) as the response variable and we elect to use the default `metric` value chosen by the DataRobot modeling engine: ```{r, echo = FALSE} project <- readRDS("AmesprojectObject.rds") project ``` ## Retrieving project results ```{r echo=FALSE, message=FALSE, warning=FALSE} library(datarobot) listOfAmesModels <- readRDS("listOfAmesModels.rds") fullFrame <- as.data.frame(listOfAmesModels, simple = FALSE) ``` The DataRobot project created by the command described above fits `r length(listOfAmesModels)` models to the `Ames` dataframe. Detailed information about all of these models can be obtained with the `ListModels` function, invoked with the `project` list returned by the `StartProject` function. ```{r, echo = TRUE, eval = FALSE} listOfAmesModels <- ListModels(project) ``` The `ListModels` function returns an S3 object of class 'listOfModels', with one element for each model in the project. A summary method has been implemented for this object class, and it provides the following view of the contents of this list: ```{r, echo = TRUE} summary(listOfAmesModels) ``` The first element of this list is `generalSummary`, which lets us know that the project includes `r length(listOfAmesModels)` models, and that the second list element describes the first 6 of these models. This number is determined by the optional parameter `nList` for the `summary` method, which has the default value 6. The second list element is `detailedSummary`, which gives the first `nList` rows of the dataframe created when the `as.data.frame` method is applied to `listOfAmesModels`. Methods for the `as.data.frame` generic function are included in the datarobot package for all four 'list of' S3 model object classes: `listOfBlueprints`, `listOfFeaturelists`, `listOfModels`, and `projectSummaryList`. (Use of this function is illustrated in the following discussion; see the help files for more complete details.) This dataframe has the following eight columns: 1. __modelType__: character string; describes the structure of each model 1. __expandedModel__: character string; __modelType__ with short descriptions of any preprocessing steps appended 1. __modelId__: unique alphanumeric model identifier 1. __blueprintId__: unique alphanumeric identifier for the blueprint used to fit the model 1. __featurelistName__: name of the featurelist defining the modeling variables 1. __featurelistId__: unique alphanumeric featurelist identifier 1. __samplePct__: fraction of the training dataset used in fitting the model 1. __validationMetric__: the value of the metric optimized in fitting the model, evaluated for the validation dataset It is possible to obtain a more complete dataframe from any object of class 'listOfModels' by using the function `as.data.frame` with the optional parameter `simple = FALSE`. Besides the eight characteristics listed above, this larger dataframe includes, for every model in the project, additional project information along with validation, cross-validation, and holdout values for all of the available metrics for the project. For the project considered in this note, the result is a dataframe with `r nrow(fullFrame)` rows and `r ncol(fullFrame)` columns. In addition to the summary method, a plot method has also been provided for objects of class 'listOfModels': ```{r, echo = TRUE, fig.width = 7, fig.height = 6, fig.cap = "Horizontal barplot of modelType and validation set Gamma Deviance values for all project models"} plot(listOfAmesModels, orderDecreasing = TRUE) ``` This function generates a horizontal barplot that lists the name of each model (i.e., `modelType`) in the center of each bar, with the bar length corresponding to the value of the model fitting metric, evaluated for the validation dataset (i.e., the `validationMetric` value). The only required parameter for this function is the 'listOfModels' class S3 object to be plotted, but there are a number of optional parameters that allow the plot to be customized. In the plot shown above, the logical parameter `orderDecreasing` has been set to `TRUE` so that the plot - generated from the bottom up - shows the models in decreasing order of `validationMetric`. For a complete list of optional parameters for this function, refer to the help files. Since smaller values of `Gamma Deviance.validation` are better, this plot shows the worst model at the bottom and the best model at the top. The identities of these models are most conveniently obtained by first converting `listOfAmesModels` into a dataframe, using the `as.data.frame` generic function mentioned above: You can also coerce the list of models to a data.frame, which may make it easier to see specific things (such as model metrics): ```{r, echo = TRUE} modelFrame <- as.data.frame(listOfAmesModels) head(modelFrame[, c("modelType", "validationMetric")]) ``` It is interesting to note that this single best model, which is fairly complex in structure, actually outperforms the blender model (the second model in the barplot above), formed by averaging the best individual project models. This behavior is unusual, since the blender models usually achieve at least a small performance advantage over the component models on which they are based. In fact, since the individual component models may be regarded as constrained versions of the blender models (e.g., as a weighted average with all of the weight concentrated on one component), the *training set* performance can never be worse for a blender than it is for its components, but this need not be true of validation set performance, as this example demonstrates. Or you can see the worst models: ```{r, echo = TRUE} tail(modelFrame[, c("modelType", "validationMetric")]) ``` It is also important to note that several of the models appear to be identical, based on their `modelType` values, but they exhibit different performances. This is most obvious from the four models labelled "eXtreme Gradient Boosted" but it is also true of six other `modelType` values, each of which appears two or three times in the plot, generally with different values for `Gamma Deviance.validation`. In fact, these models are not identical, but differ in the preprocessing applied to them, or in other details. In most cases, these differences may be seen by examining the `expandedModel` values from `modelFrame`. For example, lets get all the different models that use the Elastic-Net Regressor: ```{r, echo = TRUE} Filter(function(m) grepl("Elastic-Net", m), modelFrame$expandedModel) ``` In particular, note that the `modelType` value appears at the beginning of the `expandedModel` character string, which is then followed by any pre-processing applied in fitting the model. Thus, comparing elements from this list (see below), we can see that they differ in their preprocessing steps and primary ML model. In the case of the Light Gradient Boosting, the difference is that Elastic-Net model is a pre-processing step that is fed into a different model. ## Generating model predictions The generation of model predictions uses `Predict`: ```{r, echo = TRUE, eval = FALSE} bestModel <- GetRecommendedModel(project, type = RecommendedModelType$RecommendedForDeployment) bestPredictions <- Predict(bestModel, Ames) ``` `GetRecommendedModel` gives us the best model without having to go through the metrics manually. We can see in this case what the model is: ```{r, echo = TRUE, eval = FALSE} bestModel$modelType ``` ```{r, echo = FALSE, eval = TRUE} "eXtreme Gradient Boosted Trees Regressor (Gamma Loss)" ``` How good are our predictions? The plot below shows predicted versus observed `Sale_Price` values for this model. If the predictions were perfect, all of these points would lie on the dashed red equality line. The relatively narrow scatter of most points around this reference line suggests that this model is performing reasonably well for most of the dataset, with a few significant exceptions. ```{r, echo = FALSE, fig.width = 7, fig.height = 6} Sale_Price <- Ames$Sale_Price bestPredictions <- readRDS("bestPredictionsAmes.rds") plot(Sale_Price, bestPredictions, xlab="Observed Sale Price", ylab="Predicted Sale Price value", ylim = c(0, 800000)) abline(a = 0, b = 1, lty = 2, lwd = 3, col = "red") title("Best model") ``` ## Feature Impact But which features most drive our predictions? Which features are most important to the model? For this, we turn to feature impact. ```{r, echo = TRUE, eval = FALSE} impact <- GetFeatureImpact(bestModel) head(impact) ``` ```{r, echo = FALSE} impact <- readRDS("IntroFeatureImpactAmes.RDS") head(impact) ``` We can now see that `Gr_Liv_Area` (above grade living area) and `Year_Built` (the year of construction) are the two best predictors of our target `Sale_Price` (most recent home sale price). Normalized impact gives us a ratio of the impact of the feature relative to the top feature, whereas unnormalized impact is the actual impact statistic. ## Summary This note has presented a general introduction to the datarobot *R* package, describing and illustrating its most important functions. To keep this summary to a manageable length, no attempt has been made to describe all of the package's capabilities; for a more detailed discussion, refer to the help files.