---
title: "ROCR: visualizing classifier performance in R"
output: rmarkdown::html_vignette
author: Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer
abstract:
  ROCR is a package for evaluating and visualizing the performance of scoring 
  classifiers in the statistical language R. It features over 25 performance 
  measures that can be freely combined to create two-dimensional performance 
  curves. Standard methods for investigating trade-offs between specific 
  performance measures are available within a uniform framework, including 
  receiver operating characteristic (ROC) graphs, precision/recall plots, lift 
  charts and cost curves. ROCR integrates tightly with R's powerful graphics 
  capabilities, thus allowing for highly adjustable plots. Being equipped with 
  only three commands and reasonable default values for optional parameters, 
  ROCR combines flexibility with ease of usage.
vignette: >
  %\VignetteIndexEntry{ROCR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bibtex
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

```{r setup}
library(ROCR)
```

Pattern classification has become a central tool in bioinformatics, offering
rapid insights into large data sets [[@Baldi2001]](#References). While one area
of our work involves predicting phenotypic properties of HIV-1 from genotypic
information
[[@Beerenwinkel2002;@Beerenwinkel2003;@Sing04learningmixtures]](#References),
scoring or ranking predictors are also vital in a wide range of other biological
problems. Examples include microarray analysis (e.g. prediction of tissue
condition based on gene expression), protein structural and functional
characterization (remote homology detection, prediction of post-translational
modifications and molecular function annotation based on sequence or structural
motifs), genome annotation (gene finding and splice site identification),
protein–ligand interactions (virtual screening and molecular docking) and
structure–activity relationships (predicting bioavailability or toxicity of drug
compounds). In many of these cases, considerable class skew, class-specific
misclassification costs, and extensive noise due to variability in experimental
assays complicate predictive modelling. Thus, careful predictor validation is
compulsory.

```{r, echo = FALSE, results = 'asis'}
table <- data.frame(group = c("Contingency ratios",
                              "Discrete covariation measures",
                              "Information retrieval measures",
                              "Performance in ROC space",
                              "Absolute scoring performance",
                              "Cost measures"),
                    measure = c("error rate, accuracy, sensitivity, specificity, true/false positive rate, fallout, miss, precision, recall, negative predictive value, prediction-conditioned fallout/miss.",
                                "Phi/Matthews correlation coefficient, mutual information, Chi-squared test statistic, odds ratio",
                                "F-measure, lift, precision-recall break-even point",
                                "ROC convex hull, area under the ROC curve",
                                "calibration error, mean cross-entropy, root mean-squared error",
                                "expected cost, explicit cost"))
knitr::kable(table,
             caption = "***Table 1:**Performance measures in the ROCR package*",
             col.names = c("",""),
             align = "l")
```

The real-valued output of scoring classifiers is turned into a binary class
decision by choosing a cutoff. As no cutoff is optimal according to all possible
performance criteria, cutoff choice involves a trade-off among different
measures. Typically, a trade-off between a pair of criteria (e.g. sensitivity
versus specificity) is visualized as a cutoff-parametrized curve in the plane
spanned by the two measures. Popular examples of such trade-off visualizations
include receiver operating characteristic (ROC) graphs, sensitivity/specificity
curves, lift charts and precision/recall plots. [@Fawcett2004](#References)
provides a general introduction into evaluating scoring classifiers with a focus
on ROC graphs.

Although functions for drawing ROC graphs are provided by the Bioconductor
project (http://www.bioconductor.org) or by the machine learning package Weka
(http://www.cs.waikato.ac.nz/ml), for example, no comprehensive evaluation suite
is available to date. ROCR is a flexible evaluation package for R
(https://www.r-project.org), a statistical language that is widely used in
biomedical data analysis. Our tool allows for creating cutoff-parametrized
performance curves by freely combining two out of more than 25 performance
measures (Table 1). Curves from different cross-validation or bootstrapping runs
can be averaged by various methods. Standard deviations, standard errors and box
plots are available to summarize the variability across the runs. The
parametrization can be visualized by printing cutoff values at the corresponding
curve positions, or by coloring the curve according to the cutoff. All
components of a performance plot are adjustable using a flexible mechanism for
dispatching optional arguments. Despite this flexibility, ROCR is easy to use,
with only three commands and reasonable default values for all optional
parameters.

In the example below, we will briefly introduce ROCR's three
commands—prediction, performance and plot—applied to a 10-fold cross-validation
set of predictions and corresponding class labels from a study on predicting HIV
coreceptor usage from the sequence of the viral envelope protein. After loading
the dataset, a prediction object is created from the raw predictions and class
labels.

```{r}
data(ROCR.hiv)
predictions <- ROCR.hiv$hiv.svm$predictions
labels <- ROCR.hiv$hiv.svm$labels
pred <- prediction(predictions, labels)
pred
```

Performance measures or combinations thereof are computed by invoking the
performance method on this prediction object. The resulting performance object
can be visualized using the method plot. For example, an ROC curve that trades
off the rate of true positives against the rate of false positives is obtained
as follows:

```{r, fig.asp=1, fig.width=5, fig.align='center'}
perf <- performance(pred, "tpr", "fpr")
perf
plot(perf,
     avg="threshold",
     spread.estimate="boxplot")
```

The optional parameter avg selects a particular form of performance curve
averaging across the validation runs; the visualization of curve variability is
determined with the parameter spread.estimate.

```{r, echo=FALSE, results='asis', fig.asp=0.35, fig.width=7, fig.align='center',fig.cap="***Fig 1:** Visualizations of classifier performance (HIV coreceptor usage data): (a) receiver operating characteristic (ROC) curve; (b) peak accuracy across a range of cutoffs; (c) absolute difference between empirical and predicted rate of positives for windowed cutoff ranges, in order to evaluate how well the scores are calibrated as probability estimates. Owing to the probabilistic interpretation, cutoffs need to be in the interval [0,1], in contrast to other performance plots. (d) Score density estimates for the negative (solid) and positive (dotted) class.*"}
data(ROCR.hiv)
pp.unnorm <- ROCR.hiv$hiv.svm$predictions
ll <- ROCR.hiv$hiv.svm$labels

# normalize predictions to 0..1
v <- unlist(pp.unnorm)
pp <- lapply(pp.unnorm, function(run) {approxfun(c(min(v), max(v)), c(0,1))(run)})

par(mfrow=c(1,4))
pred<- prediction(pp, ll)
perf <- performance(pred, "tpr", "fpr")

plot(perf, avg= "threshold", colorize=TRUE, lwd= 3,
     coloraxis.at=seq(0,1,by=0.2),)
plot(perf, col="gray78", add=TRUE)
plot(perf, avg= "threshold", colorize=TRUE, colorkey=FALSE,lwd= 3,,add=TRUE)
mtext(paste0("(a)"), side = 3, adj = 0.01,line = 1)

perf <- performance(pred, "acc")
plot(perf, avg= "vertical", spread.estimate="boxplot", lwd=3,col='blue',
     show.spread.at= seq(0.1, 0.9, by=0.1),)
mtext(paste0("(b)"), side = 3, adj = 0.01,line = 1)


plot(performance(pred, "cal", window.size= 10),
     avg="vertical",)
mtext(paste0("(c)"), side = 3, adj = 0.01,line = 1)

plot(0,0,type="n", xlim= c(0,1), ylim=c(0,7),
     xlab="Cutoff", ylab="Density",)
mtext(paste0("(d)"), side = 3, adj = 0.01,line = 1)
for (runi in 1:length(pred@predictions)) {
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="-1"]), col= "red")
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="1"]), col="green")
}
```

Issuing `demo(ROCR)` starts a demonstration of further graphical capabilities of
ROCR. The command `help(package=ROCR)` points to the available help pages. In
particular, a complete list of available performance measures can be obtained
via help(performance). A reference manual can be downloaded from the ROCR
website.

In conclusion, ROCR is a comprehensive tool for evaluating scoring classifiers
and producing publication-quality figures. It allows for studying the
intricacies inherent to many biological datasets and their implications on
classifier performance.

## Additional examples

Below you can find many additional examples of ROCR's features of performance 
measurement and the possibilites in plotting. However, this only a first 
taste. For more examples, please run `demo(ROCR)` and make sure the plotting
deminsions are big enough.

### ROC curves, Precision/Recall graphs and more ...

```{r, fig.asp=1, fig.width=5, fig.align='center'}
perf <- performance(pred, "tpr", "fpr")
plot(perf,
     avg= "threshold",
     colorize=TRUE,
     lwd= 3,
     main= "With ROCR you can produce standard plots\nlike ROC curves ...")
plot(perf,
     lty=3,
     col="grey78",
     add=TRUE)
```


```{r, fig.asp=1, fig.width=5, fig.align='center'}
perf <- performance(pred, "prec", "rec")
plot(perf,
     avg= "threshold",
     colorize=TRUE,
     lwd= 3,
     main= "... Precision/Recall graphs ...")
plot(perf,
     lty=3,
     col="grey78",
     add=TRUE)
```


```{r, fig.asp=1, fig.width=5, fig.align='center'}
perf <- performance(pred, "sens", "spec")
plot(perf,
     avg= "threshold",
     colorize=TRUE,
     lwd= 3,
     main="... Sensitivity/Specificity plots ...")
plot(perf,
     lty=3,
     col="grey78",
     add=TRUE)
```


```{r, fig.asp=1, fig.width=5, fig.align='center'}
perf <- performance(pred, "lift", "rpp")
plot(perf,
     avg= "threshold",
     colorize=TRUE,
     lwd= 3,
     main= "... and Lift charts.")
plot(perf,
     lty=3,
     col="grey78",
     add=TRUE)
```

### Averaging over multiple predictions

Multiple batches of predictions can be analyzed at the same time.

```{r}
data(ROCR.xval)
predictions <- ROCR.xval$predictions
labels <- ROCR.xval$labels
length(predictions)
```

```{r}
pred <- prediction(predictions, labels)
perf <- performance(pred,'tpr','fpr')
```

This can be used for plotting averages using the `avg` argument.

```{r, fig.asp=1, fig.width=5, fig.align='center'}
plot(perf,
     colorize=TRUE,
     lwd=2,
     main='ROC curves from 10-fold cross-validation')
```

```{r, fig.asp=1, fig.width=5, fig.align='center'}
plot(perf,
     avg='vertical',
     spread.estimate='stderror',
     lwd=3,main='Vertical averaging + 1 standard error',
     col='blue')
```

```{r, fig.asp=1, fig.width=5, fig.align='center'}
plot(perf,
     avg='horizontal',
     spread.estimate='boxplot',
     lwd=3,
     main='Horizontal averaging + boxplots',
     col='blue')
```

```{r, fig.asp=1, fig.width=5, fig.align='center'}
plot(perf,
     avg='threshold',
     spread.estimate='stddev',
     lwd=2,
     main='Threshold averaging + 1 standard deviation',
     colorize=TRUE)
```

### Cutoff stacking

```{r, fig.asp=1, fig.width=6, fig.align='center'}
plot(perf,
     print.cutoffs.at=seq(0,1,by=0.2),
     text.cex=0.8,
     text.y=lapply(as.list(seq(0,0.5,by=0.05)), function(x) { rep(x,length(perf@x.values[[1]])) } ),
     col= as.list(terrain.colors(10)),
     text.col= as.list(terrain.colors(10)), 
     points.col= as.list(terrain.colors(10)), 
     main= "Cutoff stability")
```

### Combination of performance measures

Performance measures can be combined freely.

```{r}
perf <- performance(pred,"pcmiss","lift")
```

```{r, fig.asp=1, fig.width=5, fig.align='center'}
plot(perf,
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,by=0.1),
     text.adj=c(1.2,1.2),
     avg="threshold",
     lwd=3,
     main= "You can freely combine performance measures ...")
```


# Acknowledgement

Work at MPI supported by EU NoE BioSapiens (LSHG-CT-2003-503265).

<a name="References"></a>

# References