The name DataRobot refers to three things: a Boston-based software company, the massively parallel modeling engine developed by the DataRobot company, and an open-source R package that allows interactive R users to connect to this modeling engine. This vignette provides a brief introduction to the datarobot R package, highlighting the following key details of its use:
To illustrate how the datarobot package is used, it is applied here
to the Ames
dataframe from the AmesHousing
package, providing simple demonstrations of all of the above steps.
The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. The datarobot R package described here allows anyone with access to one of these implementations to interact with it from an interactive R session. Connection between the R session and the modeling engine is accomplished via HTTP requests, with an initial connection established in one of two ways described in the next section.
The DataRobot modeling engine is organized around modeling
projects, each based on a single data source, a single target
variable to be predicted, and a single metric to be optimized in fitting
and ranking project models. This information is sufficient to create a
project, identified by a unique alphanumeric projectId
label, and start the DataRobot Autopilot, which builds, evaluates, and
summarizes a collection of models. While the Autopilot is running,
intermediate results are saved in a list that is updated until the
project completes. The last stage of the modeling process constructs
blender models, ensemble models that combine two or more of the
best-performing individual models in various different ways. These
models are ranked in the same way as the individual models and are
included in the final project list. When the project is complete, the
essential information about all project models may be obtained with the
ListModels
function described later in this note. This
function returns an S3 object of class ‘listOfModels’, which is a list
with one element for each project model. A plot method has been defined
for this object class, providing a convenient way to visualize the
relative performance of these project models.
To access the DataRobot modeling engine, it is necessary to establish
an authenticated connection, which can be done in one of two ways. In
both cases, the necessary information is an endpoint
- the
URL address of the specific DataRobot server being used - and a
token
, a previously validated access token.
token
is unique for each DataRobot modeling engine
account and can be accessed using the DataRobot webapp in the account
profile section. It looks like a string of letters and numbers.
endpoint
depends on DataRobot modeling engine
installation (cloud-based, on-prem…) you are using. Contact your
DataRobot admin for endpoint to use. The endpoint
for
DataRobot cloud accounts is
https://app.datarobot.com/api/v2
The first access method uses a YAML configuration file with these two
elements - labeled token
and endpoint
-
located at $HOME/.config/datarobot/drconfig.yaml. If this file exists
when the datarobot package is loaded, a connection to the DataRobot
modeling engine is automatically established. It is also possible to
establish a connection using this YAML file via the ConnectToDataRobot
function, by specifying the configPath parameter.
The second method of establishing a connection to the DataRobot
modeling engine is to call the function ConnectToDataRobot with the
endpoint
and token
parameters.
DataRobot API can work behind a non-transparent HTTP proxy server.
Please set environment variable http_proxy
containing proxy
URL to route all the DataRobot traffic through that proxy server,
e.g. http_proxy="http://my-proxy.local:3128" R -f my_datarobot_script.r
.
One of the most common and important uses of the datarobot R package is the creation of a new modeling project. This task is supported by the following three functions:
The first step in creating a new DataRobot modeling project uses the
StartProject
function, which has one required parameter,
dataSource
, that can be a dataframe, an object whose class
inherits from dataframe (e.g., a data.table
), or a CSV
file. Although it is not required, the optional parameter
projectName
can be extremely useful in managing projects,
especially as their number grows; in particular, while every project has
a unique alphanumeric identifier projectId
associated with
it, this string is not easy to remember. Another optional parameter is
maxWait
, which specifies the maximum time in seconds before
the project creation task aborts; increasing this parameter from its
default value can be useful when working with large datasets.
StartProject
also starts the model-building process by
specifying a target
, a character string that names the
response variable to be predicted by all models in the project. Of the
optional parameters for the StartProjct
function, the only
one discussed here is metric
, a character string that
specifies the measure to be optimized in fitting project models.
Admissible values for this parameter are determined by the DataRobot
modeling engine based on the nature of the target
variable.
A list of these values can be obtained using the function
GetValidMetrics
. The required parameters for this function
are project
and target
, but here there are no
optional parameters. The default value for the optional
metric
parameter in the StartProject
function
call is NULL
, which causes the default metric recommended
by the DataRobot modeling engine to be adopted. For a complete
discussion of the other optional parameters for the
StartProject
function, refer to the help files.
DataRobot also supports using an offset parameter in
StartProject
. Offsets are commonly used in insurance
modeling to include effects that are outside of the training data due to
regulatory compliance or constraints. You can specify the names of
several columns in the project dataset to be used as the offset
columns.
DataRobot also supports using an exposure parameter in
StartProject
. Exposure is often used to model insurance
premiums where strict proportionality of premiums to duration is
required. You can specify the name of the column in the project dataset
to be used as an exposure column.
To provide a specific illustration of how a new DataRobot project is
created, the following discussion shows the creation of a project based
on the Ames
dataframe from the AmesHousing
package. This dataframe characterizes housing prices in Ames, Iowa from
2006 to 2010 with 2930 observations and a large number of explanatory
variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous)
involved in assessing home values. For this vignette, only the numeric
variables are used (DataRobot can handle non-numeric, text, image and
geospatial variables). See DeCock (2011) “Ames, Iowa: Alternative to the
Boston Housing Data as an End of Semester Regression Project”
Journal of Statistics Education, vol. 19, no. 3, 2011). The
dataframe is described in more detail in the associated help file from
the AmesHousing
package, but the head
function
shows its basic structure:
## # A tibble: 6 × 35
## Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
## <dbl> <int> <int> <int> <dbl> <dbl>
## 1 141 31770 1960 1960 112 2
## 2 80 11622 1961 1961 0 6
## 3 81 14267 1958 1958 108 1
## 4 93 11160 1968 1968 0 1
## 5 74 13830 1997 1998 0 3
## 6 78 9978 1998 1998 20 3
## # ℹ 29 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
## # Low_Qual_Fin_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
## # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
## # Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>, Wood_Deck_SF <int>,
## # Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>, …
To create the modeling project for this dataframe, we first use the
StartProject
function:
project <- StartProject(dataSource = Ames,
projectName = "AmesVignetteProject",
target = "Sale_Price",
wait = TRUE)
dataSource
defines the data used for predictions,
projectName
defines the name of the project,
target
defines what to predict, and
wait = TRUE
tells the function to wait until all modeling
is complete before executing other computations.
The list returned by this function gives the project name, the
project identifier (projectId
), the name of the temporary
CSV file used to save and upload the Ames
dataframe, and
the time and date the project was created. Here, we specify
Sale_Price
(sale price of the home) as the response
variable and we elect to use the default metric
value
chosen by the DataRobot modeling engine:
## $projectName
## [1] "AmesVignetteProject"
##
## $projectId
## [1] "62dd89feb85947ad980ba5f0"
##
## $fileName
## [1] "file14bfb1c4da321_autoSavedDF.csv"
##
## $created
## [1] "2022-07-24T18:06:13.420660Z"
##
## attr(,"class")
## [1] "dataRobotProject"
The DataRobot project created by the command described above fits 13
models to the Ames
dataframe. Detailed information about
all of these models can be obtained with the ListModels
function, invoked with the project
list returned by the
StartProject
function.
The ListModels
function returns an S3 object of class
‘listOfModels’, with one element for each model in the project. A
summary method has been implemented for this object class, and it
provides the following view of the contents of this list:
## $generalSummary
## [1] "First 6 of 13 models from: listOfAmesModels (S3 object of class listOfModels)"
##
## $detailedSummary
## modelType
## 1 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 2 AVG Blender
## 3 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 4 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 5 Light Gradient Boosting on ElasticNet Predictions (Gamma Loss)
## 6 Light Gradient Boosted Trees Regressor with Early Stopping (Gamma Loss)
## expandedModel
## 1 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)::Geospatial Location Converter::Missing Values Imputed
## 2 AVG Blender::Average Blender
## 3 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)::Geospatial Location Converter::Missing Values Imputed
## 4 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)::Geospatial Location Converter::Missing Values Imputed
## 5 Light Gradient Boosting on ElasticNet Predictions (Gamma Loss)::Numeric Data Cleansing::Smooth Ridit Transform::Elastic-Net Regressor (L2 / Gamma Deviance)
## 6 Light Gradient Boosted Trees Regressor with Early Stopping (Gamma Loss)::Geospatial Location Converter::Missing Values Imputed
## modelId blueprintId
## 1 62dd8b7b6dcff238c40735cd bda168d7725c76940aaf66a5b87dd936
## 2 62dd8b924e47bc6fed6916c2 00a968c06d6324a9630b1357a0a96748
## 3 62dd8a50748806bd8b60c35d 782081a59965e1cb2c171d21b47505d1
## 4 62dd8b5e7665693929b242b2 782081a59965e1cb2c171d21b47505d1
## 5 62dd8a4f748806bd8b60c35a acd16a05344349c0f17a3c99675ddeb2
## 6 62dd8a4f748806bd8b60c359 6ee066b11e7e618b16ab5f9adf2ec9f8
## featurelistName featurelistId samplePct validationMetric
## 1 Informative Features 62dd8a24b4b51b9ba2954729 100.0000 0.01535
## 2 Multiple featurelists Multiple featurelist ids 63.9932 0.01538
## 3 Informative Features 62dd8a24b4b51b9ba2954729 63.9932 0.01539
## 4 Informative Features 62dd8a24b4b51b9ba2954729 80.0000 0.01549
## 5 Informative Features 62dd8a24b4b51b9ba2954729 63.9932 0.01549
## 6 Informative Features 62dd8a24b4b51b9ba2954729 63.9932 0.01572
The first element of this list is generalSummary
, which
lets us know that the project includes 13 models, and that the second
list element describes the first 6 of these models. This number is
determined by the optional parameter nList
for the
summary
method, which has the default value 6. The second
list element is detailedSummary
, which gives the first
nList
rows of the dataframe created when the
as.data.frame
method is applied to
listOfAmesModels
. Methods for the
as.data.frame
generic function are included in the
datarobot package for all four ‘list of’ S3 model object classes:
listOfBlueprints
, listOfFeaturelists
,
listOfModels
, and projectSummaryList
. (Use of
this function is illustrated in the following discussion; see the help
files for more complete details.) This dataframe has the following eight
columns:
It is possible to obtain a more complete dataframe from any object of
class ‘listOfModels’ by using the function as.data.frame
with the optional parameter simple = FALSE
. Besides the
eight characteristics listed above, this larger dataframe includes, for
every model in the project, additional project information along with
validation, cross-validation, and holdout values for all of the
available metrics for the project. For the project considered in this
note, the result is a dataframe with 13 rows and 51 columns.
In addition to the summary method, a plot method has also been provided for objects of class ‘listOfModels’:
This function generates a horizontal barplot that lists the name of
each model (i.e., modelType
) in the center of each bar,
with the bar length corresponding to the value of the model fitting
metric, evaluated for the validation dataset (i.e., the
validationMetric
value). The only required parameter for
this function is the ‘listOfModels’ class S3 object to be plotted, but
there are a number of optional parameters that allow the plot to be
customized. In the plot shown above, the logical parameter
orderDecreasing
has been set to TRUE
so that
the plot - generated from the bottom up - shows the models in decreasing
order of validationMetric
. For a complete list of optional
parameters for this function, refer to the help files.
Since smaller values of Gamma Deviance.validation
are
better, this plot shows the worst model at the bottom and the best model
at the top. The identities of these models are most conveniently
obtained by first converting listOfAmesModels
into a
dataframe, using the as.data.frame
generic function
mentioned above:
You can also coerce the list of models to a data.frame, which may make it easier to see specific things (such as model metrics):
modelFrame <- as.data.frame(listOfAmesModels)
head(modelFrame[, c("modelType", "validationMetric")])
## modelType
## 1 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 2 AVG Blender
## 3 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 4 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 5 Light Gradient Boosting on ElasticNet Predictions (Gamma Loss)
## 6 Light Gradient Boosted Trees Regressor with Early Stopping (Gamma Loss)
## validationMetric
## 1 0.01535
## 2 0.01538
## 3 0.01539
## 4 0.01549
## 5 0.01549
## 6 0.01572
It is interesting to note that this single best model, which is fairly complex in structure, actually outperforms the blender model (the second model in the barplot above), formed by averaging the best individual project models. This behavior is unusual, since the blender models usually achieve at least a small performance advantage over the component models on which they are based. In fact, since the individual component models may be regarded as constrained versions of the blender models (e.g., as a weighted average with all of the weight concentrated on one component), the training set performance can never be worse for a blender than it is for its components, but this need not be true of validation set performance, as this example demonstrates.
Or you can see the worst models:
## modelType
## 8 eXtreme Gradient Boosted Trees Regressor (Gamma Loss)
## 9 RandomForest Regressor
## 10 Generalized Additive2 Model (Gamma Loss)
## 11 Elastic-Net Regressor (L2 / Gamma Deviance)
## 12 Elastic-Net Regressor (mixing alpha=0.5 / Gamma Deviance)
## 13 Keras Slim Residual Neural Network Regressor using Training Schedule (1 Layer: 64 Units)
## validationMetric
## 8 0.01669
## 9 0.01772
## 10 0.01885
## 11 0.02282
## 12 0.02338
## 13 0.03272
It is also important to note that several of the models appear to be
identical, based on their modelType
values, but they
exhibit different performances. This is most obvious from the four
models labelled “eXtreme Gradient Boosted” but it is also true of six
other modelType
values, each of which appears two or three
times in the plot, generally with different values for
Gamma Deviance.validation
. In fact, these models are not
identical, but differ in the preprocessing applied to them, or in other
details.
In most cases, these differences may be seen by examining the
expandedModel
values from modelFrame
. For
example, lets get all the different models that use the Elastic-Net
Regressor:
## [1] "Light Gradient Boosting on ElasticNet Predictions (Gamma Loss)::Numeric Data Cleansing::Smooth Ridit Transform::Elastic-Net Regressor (L2 / Gamma Deviance)"
## [2] "Elastic-Net Regressor (L2 / Gamma Deviance)::Geospatial Location Converter::Smooth Ridit Transform::Numeric Data Cleansing"
## [3] "Elastic-Net Regressor (mixing alpha=0.5 / Gamma Deviance)::Geospatial Location Converter::Smooth Ridit Transform::Missing Values Imputed"
In particular, note that the modelType
value appears at
the beginning of the expandedModel
character string, which
is then followed by any pre-processing applied in fitting the model.
Thus, comparing elements from this list (see below), we can see that
they differ in their preprocessing steps and primary ML model. In the
case of the Light Gradient Boosting, the difference is that Elastic-Net
model is a pre-processing step that is fed into a different model.
The generation of model predictions uses Predict
:
bestModel <- GetRecommendedModel(project,
type = RecommendedModelType$RecommendedForDeployment)
bestPredictions <- Predict(bestModel, Ames)
GetRecommendedModel
gives us the best model without
having to go through the metrics manually. We can see in this case what
the model is:
## [1] "eXtreme Gradient Boosted Trees Regressor (Gamma Loss)"
How good are our predictions? The plot below shows predicted versus
observed Sale_Price
values for this model. If the
predictions were perfect, all of these points would lie on the dashed
red equality line. The relatively narrow scatter of most points around
this reference line suggests that this model is performing reasonably
well for most of the dataset, with a few significant exceptions.
But which features most drive our predictions? Which features are most important to the model? For this, we turn to feature impact.
## featureName impactNormalized impactUnnormalized redundantWith
## 1 Gr_Liv_Area 1.0000000 0.010288891 NA
## 2 Year_Built 0.9732372 0.010013532 NA
## 3 Total_Bsmt_SF 0.9005380 0.009265537 NA
## 4 Year_Remod_Add 0.7162664 0.007369587 NA
## 5 geometry 0.5108179 0.005255749 NA
## 6 Garage_Cars 0.3961717 0.004076167 NA
We can now see that Gr_Liv_Area
(above grade living
area) and Year_Built
(the year of construction) are the two
best predictors of our target Sale_Price
(most recent home
sale price). Normalized impact gives us a ratio of the impact of the
feature relative to the top feature, whereas unnormalized impact is the
actual impact statistic.
This note has presented a general introduction to the datarobot R package, describing and illustrating its most important functions. To keep this summary to a manageable length, no attempt has been made to describe all of the package’s capabilities; for a more detailed discussion, refer to the help files.