--- title: "Introduction to Prediction Explanations" author: "Peter Hurford, Colin Priest, Sergey Yurgenson, Thakur Raj Anand" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Using Prediction Explanations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- A few questions always asked by business leaders after seeing the results of highly accurate machine learning models are as follows - Are machine learning models interpretable and transparent? - How can the results of the model be used to develop a business strategy? - Can the predictions from the model be used to explain to the regulators why something was rejected or accepted based on model prediction? DataRobot does provide many diagnostics like partial dependence, feature impact, and prediction explanations to answer the above questions and using those diagnostics predictions can be converted to prescriptions for the business. In this vignette we would be covering prediction explanations. Partial dependence has been covered in detail in the companion vignette "Interpreting Predictive Models Using Partial Dependence Plots". ## Introduction The DataRobot modeling engine is a commercial product that supports the rapid development and evaluation of a large number of different predictive models from a single data source. The open-source R package datarobot allows users of the DataRobot modeling engine to interact with it from R, creating new modeling projects, examining model characteristics, and generating predictions from any of these models for a specified dataset. This vignette illustrates how to interact with DataRobot using **datarobot** package, build models, make prediction using a model and then use prediction explanations to explain why a model is predicting high or low. Prediction explanations can be used to answer the questions mentioned earlier. ## Load the useful libraries Let's load **datarobot** and other useful packages ```{r results = "asis", message = FALSE, warning = FALSE} library(httr) library(knitr) library(data.table) ``` ## Connecting to DataRobot To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an **endpoint** - the URL address of the specific DataRobot server being used - and a **token**, a previously validated access token. **token** is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section. **endpoint** depends on DataRobot modeling engine installation (cloud-based, on-prem...) you are using. Contact your DataRobot admin for endpoint to use. The **endpoint** for DataRobot cloud accounts is ```https://app.datarobot.com/api/v2``` The first access method uses a YAML configuration file with these two elements - labeled **token** and **endpoint** - located at $HOME/.config/datarobot/drconfig.yaml. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established. It is also possible to establish a connection using this YAML file via the ConnectToDataRobot function, by specifying the configPath parameter. The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the **endpoint** and **token** parameters. ```{r results = 'asis', message = FALSE, warning = FALSE, eval = FALSE} library(datarobot) endpoint <- "https:///api/v2" apiToken <- "" ConnectToDataRobot(endpoint = endpoint, token = apiToken) ``` ## Data We would be using a sample dataset related to credit scoring open sourced by LendingClub. Below is the information related to the variables. ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} Lending <- fread("https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv") EDA <- t(summary(Lending)) kable(EDA, longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ## Divide data into train and test and setup the project Let's divide our data in train and test. We can use train data to create a datarobot project using **StartProject** function and test data to make predictions and generate prediction explanations. Detailed explanation about creating projects was described in the vignette , “Introduction to the DataRobot R Package.” The specific sequence used here was: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} target <- "is_bad" projectName <- "Credit Scoring" set.seed(1111) split <- sample(nrow(Lending), round(0.9 * nrow(Lending)), replace = FALSE) train <- Lending[split,] test <- Lending[-split,] project <- StartProject(dataSource = train, projectName = projectName, target = target, workerCount = "max", wait = TRUE) ``` Once the modeling process has completed, the ListModels function returns an S3 object of class “listOfModels” that characterizes all of the models in a specified DataRobot project. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} results <- as.data.frame(ListModels(project)) kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} results <- readRDS("PredictionExplanationsModelResults.rds") kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ## Generating Model Predictions Let's look at some model predictions. The generation of model predictions uses the `Predict` function: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} bestModel <- GetRecommendedModel(project) bestPredictions <- Predict(bestModel, test, type = "probability") testPredictions <- data.frame(original = test$is_bad, prediction = bestPredictions) kable(head(testPredictions), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} testPredictions <- readRDS("PredictionExplanationsTestPredictions.rds") kable(head(testPredictions), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ## Calculate Prediction Explanations For each prediction, DataRobot provides an ordered list of explanations; the number of explanations is based on the setting. Each explanation is a feature from the dataset and its corresponding value, accompanied by a qualitative indicator of the explanation’s strength—strong (+++), medium (++), or weak (+) positive or negative (-) influence. There are three main inputs you can set for DataRobot to use when computing prediction explanations 1. `maxExplanations`: the Number of explanations for each predictions. Default is 3. 2. `thresholdLow`: Probability threshold below which DataRobot should calculate prediction explanations. 3. `thresholdHigh`: Probability threshold above which DataRobot should calculate prediction explanations. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} explanations <- GetPredictionExplanations(bestModel, test, maxExplanations = 3, thresholdLow = 0.25, thresholdHigh = 0.75) kable(head(explanations), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} explanations <- readRDS("PredictionExplanations.rds") kable(head(explanations), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` From the example above, you could answer "Why did the model give one of the customers a 97% probability of defaulting?" Top explanation explains that **purpose_cat** of loan was "credit card small business"" and we can also see in above example that whenever model is predicting high probability of default, **purpose_cat** is related to small business. Some notes on explanations: - If the data points are very similar, the explanations can list the same rounded up values. - It is possible to have a explanation state of MISSING if a “missing value” was important in making the prediction. - Typically, the top explanations for a prediction have the same direction as the outcome, but it’s possible that with interaction effects or correlations among variables a explanation could, for instance, have a strong positive impact on a negative prediction. ## Adjusted Predictions in Prediction Explanations In some projects -- such as insurance projects -- the prediction adjusted by exposure is more useful to look at than just raw prediction. For example, the raw prediction (e.g. claim counts) is divided by exposure (e.g. time) in the project with exposure column. The adjusted prediction provides insights with regard to the predicted claim counts per unit of time. To include that information, set `excludeAdjustedPredictions` to False in correspondent method calls. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} explanations <- GetPredictionExplanations(bestModel, test, maxExplanations = 3, thresholdLow = 0.25, thresholdHigh = 0.75, excludeAdjustedPredictions = FALSE) kable(head(explanations), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} explanations <- readRDS("PredictionExplanationsExposure.rds") kable(head(explanations), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ## Summary This note has described the **Prediction Explanations** which are useful for understanding why model is predicting high or low predictions for a specific case. DataRobot also provides qualitative strength of each explanation. **Prediction Explanations** can be used in developing good business strategy by taking prescriptions based on the explanations which are responsible for high or low predictions. They are also useful in explaining the actions taken based on the model predictions to regulatory or compliance department within an organization.