--- title: "Introduction to Lift Chart, ROC Curve and Word Cloud" author: "Peter Hurford, Thakur Raj Anand, Chester Ismay" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Advanced Model Insights} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In V2.7 release of DataRobot API, the following model insights have been added: - Lift Chart - ROC Curve - Word Cloud Insights provided by Lift Chart and ROC Curves are helpful in checking the performance of machine learning models. Word clouds are helpful for understanding useful words and phrases generated after applying different NLP techniques to unstructured data. We will explore each one of these in detail. ## Connecting to DataRobot To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an **endpoint**, the URL address of the specific DataRobot server being used and a **token**, a previously validated access token. **token** is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section. **endpoint** depends on DataRobot modeling engine installation (cloud-based vs. on-premise) you are using. Contact your DataRobot admin for information on which endpoint to use if you do not know. The **endpoint** for DataRobot cloud accounts is `https://app.datarobot.com/api/v2`. The first access method uses a YAML configuration file with these two elements - labeled **token** and **endpoint** - located at `$HOME/.config/datarobot/drconfig.yaml`. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established during `library(datarobot)`. It is also possible to establish a connection using this YAML file via the `ConnectToDataRobot()` function, by specifying the `configPath` parameter. The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the **endpoint** and **token** parameters. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} library(datarobot) ConnectToDataRobot(endpoint = "http:///api/v2", token = "") ``` ## Data We will be using the [Lending Club dataset](https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv), a sample dataset related to credit scoring open-sourced by LendingClub. We can create a project with this dataset like this: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} lendingClubURL <- "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv" project <- StartProject(dataSource = lendingClubURL, projectName = "AdvancedModelInsightsVignette", mode = "auto", target = "is_bad", workerCount = "max", wait = TRUE) ``` Once the modeling process has completed, the `ListModels` function returns an S3 object of class `listOfModels` that characterizes all of the models in a specified DataRobot project. It is important to use `WaitforAutopilot` before calling `ListModels`, as the function will return only a partial list (and a warning) if the autopilot is not yet complete. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} results <- as.data.frame(ListModels(project)) saveRDS(results, "resultsModelInsights.rds") library(knitr) kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} results <- readRDS("resultsModelInsights.rds") library(knitr) kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE) ``` ## Lift Chart Lift chart data can be retrieved for a specific data partition (validation, cross-validation, or holdout) or for all the data partitions using `GetLiftChart` and `ListLiftCharts`. To retrieve the data for holdout partition, it needs to be unlocked first. Let's retrieve the validation partition data for top model using `GetLiftChart`. The `GetLiftChart` function returns data for validation partition by default. We can retrieve data for specific data partition by passing value to source parameter in `GetLiftChart`. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} project <- GetProject("5eed0d790ef80408ae212f09") allModels <- ListModels(project) saveRDS(allModels, "modelsModelInsights.rds") modelFrame <- as.data.frame(allModels) metric <- modelFrame$validationMetric if (project$metric %in% c('AUC', 'Gini Norm')) { bestIndex <- which.max(metric) } else { bestIndex <- which.min(metric) } bestModel <- allModels[[bestIndex]] bestModel$modelType ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} allModels <- readRDS("modelsModelInsights.rds") bestModel <- allModels[[1]] bestModel$modelType ``` This selects a Gradient Boosted Greedy Trees Classifier with Early Stopping model. The lift chart data we retrieve from the server includes the mean of the model prediction and the mean of the actual target values, sorted by the prediction values in ascending order and split into up to 60 bins. ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} lc <- GetLiftChart(bestModel) saveRDS(lc, "liftChartModelInsights.rds") head(lc) ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} lc <- readRDS("liftChartModelInsights.rds") head(lc) ``` ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} ValidationLiftChart <- GetLiftChart(bestModel, source = "validation") dr_dark_blue <- "#08233F" dr_blue <- "#1F77B4" dr_orange <- "#FF7F0E" # Function to plot lift chart library(data.table) LiftChartPlot <- function(ValidationLiftChart, bins = 10) { if (60 %% bins == 0) { ValidationLiftChart$bins <- rep(seq(bins), each = 60 / bins) ValidationLiftChart <- data.table(ValidationLiftChart) ValidationLiftChart[, actual := mean(actual), by = bins] ValidationLiftChart[, predicted := mean(predicted), by = bins] unique(ValidationLiftChart[, -"binWeight"]) } else { "Please provide bins less than 60 and divisor of 60" } } LiftChartData <- LiftChartPlot(ValidationLiftChart) saveRDS(LiftChartData, "LiftChartDataVal.rds") par(bg = dr_dark_blue) plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b", main = "Lift Chart", xlab = "Bins", ylab = "Value") lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b") ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} # dr_dark_blue <- "#08233F" # dr_blue <- "#1F77B4" # dr_orange <- "#FF7F0E" # LiftChartData <- readRDS("LiftChartDataVal.rds") # par(bg = dr_dark_blue) # plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b", # main = "Lift Chart", xlab = "Bins", ylab = "Value") # lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b") knitr::include_graphics("liftChartValidation.png") ``` All the available lift chart data can be retrieved using `ListLiftCharts`. Here is an example retrieving data for all the available partitions, followed by plotting the cross validation partition: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} AllLiftChart <- ListLiftCharts(bestModel) LiftChartData <- LiftChartPlot(AllLiftChart[["crossValidation"]]) saveRDS(LiftChartData, "LiftChartDataCV.rds") par(bg = dr_dark_blue) plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b", main = "Lift Chart", xlab = "Bins", ylab = "Value") lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b") ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} # LiftChartData <- readRDS("LiftChartDataCV.rds") # par(bg = dr_dark_blue) # plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b", # main = "Lift Chart", xlab = "Bins", ylab = "Value") # lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b") knitr::include_graphics("liftChartCrossValidation.png") ``` We can also plot the lift chart using `ggplot2`: ```{r, eval = TRUE} library(ggplot2) lc$actual <- lc$actual / lc$binWeight lc$predicted <- lc$predicted / lc$binWeight lc <- lc[order(lc$predicted), ] lc$binWeight <- NULL lc <- data.frame(value = c(lc$actual, lc$predicted), variable = c(rep("Actual", length(lc$actual)), rep("Predicted", length(lc$predicted))), id = rep(seq_along(lc$actual), 2)) ggplot(lc) + geom_line(aes(x = id, y = value, color = variable)) ``` ## ROC Curve Data The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. ROC curve data can be generated for a specific data partition (validation, cross validation, or holdout) or for all the data partition using `GetRocCurve` and `ListRocCurves`. To retrieve ROC curve information use `GetRocCurve`: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} roc <- GetRocCurve(bestModel) saveRDS(roc, "ROCCurveModelInsights.rds") ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} lc <- readRDS("ROCCurveModelInsights.rds") ``` You can then plot the results: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} dr_dark_blue <- "#08233F" dr_roc_green <- "#03c75f" ValidationRocCurve <- GetRocCurve(bestModel) ValidationRocPoints <- ValidationRocCurve[["rocPoints"]] saveRDS(ValidationRocPoints, "ValidationRocPoints.rds") par(bg = dr_dark_blue, xaxs = "i", yaxs = "i") plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate, main = "ROC Curve", xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)", col = dr_roc_green, ylim = c(0,1), xlim = c(0,1), pch = 20, type = "b") ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} dr_dark_blue <- "#08233F" dr_roc_green <- "#03c75f" ValidationRocPoints <- readRDS("ValidationRocPoints.rds") par(bg = dr_dark_blue, xaxs = "i", yaxs = "i") plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate, main = "ROC Curve", xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)", col = dr_roc_green, ylim = c(0, 1), xlim = c(0, 1), pch = 20, type = "b") ``` All the available ROC curve data can be retrieved using `ListRocCurves`. Here again is an example to retrieve data for all the available partitions, followed by plotting the cross validation partition: ```{r results = "asis", message = FALSE, warning = FALSE, eval = FALSE} AllRocCurve <- ListRocCurves(bestModel) CrossValidationRocPoints <- AllRocCurve[['crossValidation']][['rocPoints']] saveRDS(CrossValidationRocPoints, 'CrossValidationRocPoints.rds') par(bg = dr_dark_blue, xaxs = "i", yaxs = "i") plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate, main = "ROC Curve", xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)", col = dr_roc_green, ylim = c(0, 1), xlim = c(0, 1), pch = 20, type = "b") ``` ```{r echo = FALSE, results = "asis", message = FALSE, warning = FALSE} CrossValidationRocPoints <- readRDS("CrossValidationRocPoints.rds") par(bg = dr_dark_blue, xaxs = "i", yaxs = "i") plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate, main = "ROC Curve", xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)", col = dr_roc_green, ylim = c(0, 1), xlim = c(0, 1), pch = 20, type = "b") ``` You can also plot the ROC curve using `ggplot2`: ```{r, eval = TRUE} ggplot( ValidationRocPoints, aes(x = falsePositiveRate, y = truePositiveRate) ) + geom_line() ``` #### Threshold operations You can get the recommended threshold value with maximal F1 score. That is the same threshold that is preselected in DataRobot when you open the "ROC curve" tab. ```{r, eval = TRUE} threshold <- ValidationRocPoints$threshold[which.max(ValidationRocPoints$f1Score)] ``` You can also estimate metrics for different threshold values. This will produce the same results as updating the threshold on the DataRobot "ROC curve" tab. ```{r, eval = FALSE} ValidationRocPoints[ValidationRocPoints$threshold == tail(Filter(function(x) x > threshold, ValidationRocPoints$threshold), 1), ] ``` ## Word Cloud The word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value. This example will show you how to obtain word cloud data and visualize it, similar to how DataRobot visualizes the word cloud in the "Model Insights" tab interface. The visualization example here uses the `modelwordcloud` package. Now let's find our word cloud: ```{r, results = "asis", message = FALSE, warning = FALSE, eval = FALSE} # Find word-based models by looking for "word" modelType wordModels <- allModels[grep("Word", lapply(allModels, `[[`, "modelType"))] wordModel <- wordModels[[1]] # Get word cloud wordCloud <- GetWordCloud(project, wordModel$modelId) saveRDS(wordCloud, "wordCloudModelInsights.rds") ``` ```{r, echo = FALSE, results = "asis", message = FALSE, warning = FALSE} library(modelwordcloud) wordCloud <- readRDS("wordCloudModelInsights.rds") ``` Now we plot it! ```{r color-specs, include = FALSE, eval = FALSE} colors <- c( colormap::colormap(c("#255FEC", "#2DBEF9")), colormap::colormap( c("#FFAC9D", "#D80909"), reverse = TRUE ) ) saveRDS(colors, "colors.rds") ``` ```{r, warning = FALSE, eval = TRUE} # Remove stop words wordCloud <- wordCloud[!wordCloud$isStopword, ] # Specify colors similar to what DataRobot produces for # a wordcloud in Insights colors <- readRDS("colors.rds") # Make word cloud suppressWarnings( wordcloud(words = wordCloud$ngram, freq = wordCloud$frequency, coefficients = wordCloud$coefficient, colors = colors, scale = c(3, 0.3)) ) ```