---
title: "Umpire 2.0: Clinically Realistic Simulations"
author: "Kevin R. Coombes and Caitlin E. Coombes"
data: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Umpire 2.0}
%\VignetteKeywords{Umpire, simulations, mixed type data, clinical data}
%\VignetteDepends{Umpire}
%\VignettePackage{Umpire}
%\VignetteEngine{knitr::rmarkdown}
---
# Introduction
Version 2.0 of the Ultimate Microarray Prediction, Inference, and
Reality Engine (Umpire) extends the functions of the Umpire 1.0 R
package to allow researchers to simulate realistic, mixed-type,
clinical data. Statisticians, computer scientists, and clinical
informaticians who develop and improve methods to analyze clinical data
from a variety of contexts (including clinical trials, population
cohorts, and electronic medical record sources) recognize that it is
difficult to evaluate methods on real data where "ground truth" is unknown.
Frequently, they turn to simulations where the can control the
underlying structure, which can result in simulations which are too
simplistic to reflect complex clinical data realities. Clinical
measurements on patients may be treated as independent, in spite of
the elaborate correlation structures that arise in networks, pathways,
organ systems, and syndromes in real biology. Further, the researcher
finds limited tools at her disposal to facilitate simulation of binary,
categorical, or mixed data at this representative level of biological
complexity.
In this vignette, we describe a workflow with the Umpire package
to simulate biologically realistic, mixed-type clinical data.
As usual, we start by loading the package:
```{r}
library(Umpire)
```
# Simulating Mixed-Type Clinical Data
Since we are going to run simulations, for reproducibility purposes, we should
set the seed of the random number generator.
```{r seed}
set.seed(84503)
```
## Model Subtypes and Survival
The simulation workflow begins by simulating complex, correlated,
continuous data with known "ground truth" by instantiating a
ClinicalEngine. We simulate 20 features and 4 clusters of
unequal size. The ClinicalEngine generates subtypes (clusters) with
known "ground truth" through an implementation of the Umpire 1.0
CancerModel and CancerEngine.
```{r}
ce <- ClinicalEngine(20, 4, isWeighted = TRUE)
summary(ce)
```
Note that the prevalences are not equal; when you use isweighted = TRUE,
they are chosen from a Dirichlet distribution.
Note also that the summary function describes the object as a
CancerEngine, since the same underlying structure is used to
implement a ClinicalEngine.
Now we confirm that the model expects to produce the 20 features that we
requested. It will do so using 10 "components", where each component consists
of a pair of correlated features.
```{r nrow}
nrow(ce)
nComponents(ce)
```
## Simulate Raw Data
The ClinicalEngine is used to simulate the raw, base dataset.
```{r}
dset <- rand(ce, 300)
```
Data are simulated as a list with two objects: simulated
data and associated clinical information, including
"ground truth" subtype membership and survival data (outcome, length of
followup, and occurrence of event of interest within the followup period).
```{r}
class(dset)
names(dset)
summary(dset$clinical)
```
The raw data are simulated as a matrix of continuous values.
```{r}
class(dset$data)
dim(dset$data)
```
## Apply Clinically Realistic Noise
The user may add further additive noise to the raw data. The
ClinicalNoiseModel simulates additive noise for each
feature _f_ and patient _i_ as a normal distribution
$E_{fi} \sim N(0, \tau)$ , where the standard deviation $\tau$
varies with a hyperparameter along the gamma distribution
$\tau \sim Gamma(shape, scale)$. Thus, the ClinicalNoiseModel
generates many features with low noise (such as a tightly calibrated
laboratory test) and some features with high noise (such as
a blood pressure measured by hand and manually entered into the
medical record.) The user may apply default parameters or individual
parameters. Next, the ClinicalNoiseModel is applied to
blur the previously simulated data. The default model
below generates a low overall level of additive noise.
```{r}
cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape = 1.02, scale = 0.05)
summary(cnm)
noisy <- blur(cnm, dset$data)
```
## Simulate Mixed-Type Data
Umpire 2.0 allows the simulation of binary, nominal,
and ordinal data from raw, continuous data in variable, user-defined
mixtures. The user defines prevalences, summing to 1, of binary,
continuous, and categorical data in the desired final mixture.
For categorical features, the user may tune the percent of categorical
data desired to be nominal and the range of the number of categories to be
simulated.
The data simulated above by the ClinicalEngine and
ClinicalNoiseModel takes rows (not columns) as features, as
an omics convention. Thus, by default, when generating data,
rows are treated as features and columns as patients. The makeDataTypes
method transposes its results to a data frame where the columns are features
and the rows are patients. This transposition both fits better with the
conventions used for clinical data, but also supports the ability to store
different kinds of (mixed-type) data in different columns.
```{r}
dt <- makeDataTypes(dset$data,
pCont = 1/3, pBin = 1/3, pCat = 1/3,
pNominal = 0.5, range = 3:9,
inputRowsAreFeatures = TRUE)
names(dt)
```
The makeDataTypes function generates a list containing two objects:
a data.frame of mixed-type data...
```{r}
class(dt$binned)
dim(dt$binned)
summary(dt$binned)
```
The cutpoints contain a record, for each feature, of data
type, break points, and labels. Here are two examples of the kind of
information stored for a cutpoint.
```{r}
dt$cutpoints[[1]]
dt$cutpoints[[5]]
```
And here is an overview of the number of features of each type.
```{r}
cp <- dt$cutpoints
type <- sapply(cp, function(X) { X$Type })
table(type)
```
The cupoitns should be saved for downstream use in
the MixedTypeEngine.
## The MixedTypeEngine
The many parameters defining a simulated data mixture can be stored as
a single MixedTypeEngine for downstream use to easily generate
future datasets with the same simulation parameters.
The MixedTypeEngine stores the following components for re-implementation:
1. The ClinicalEngine, including parameters for generating the subtype pattern and survival model.
2. The ClinicalNoiseModel.
3. The cutpoints generated by makeDataTypes.
```{r}
mte <- MixedTypeEngine(ce,
noise = cnm,
cutpoints = dt$cutpoints)
summary(mte)
```
With rand, the user can easily generate new data sets with
the same simulation parameters.
```{r}
dset2 <- rand(mte, 20)
class(dset2)
summary(dset2$data)
summary(dset2$clinical)
```
By using the keepal argument othe function, you can keep the
intermediate datasets produced by the rand method.
```{r}
dset3 <- rand(mte, 25, keepall = TRUE)
class(dset3)
names(dset3)
```
The raw and noisy elements have the rows as (future clinical)
features and the columns as patients/samples.
```{r raw}
dim(dset3$raw)
summary(t(dset3$raw))
dim(t(dset3$noisy))
summary(dset3$noisy)
```
Noisy data arises by adding simulated noise to the raw data.
```{r, fig.cap="Raw and noisy data."}
plot(dset3$raw[5,], dset3$noisy[5,], xlab = "Raw", ylab = "Noisy", pch=16)
```
The binned element has columns as features and rows as samples.
Binned data arises by applying cut points to noisy data.
```{r fig.cap = "Noisy and binned data."}
dim(dset3$binned)
summary(dset3$binned)
plot(dset3$binned[,5], dset3$noisy[5,], xlab = "Binned", ylab = "Noisy")
```