--- title: "Umpire 2.0: Clinically Realistic Simulations" author: "Kevin R. Coombes and Caitlin E. Coombes" data: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Umpire 2.0} %\VignetteKeywords{Umpire, simulations, mixed type data, clinical data} %\VignetteDepends{Umpire} %\VignettePackage{Umpire} %\VignetteEngine{knitr::rmarkdown} --- # Introduction Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where "ground truth" is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity. In this vignette, we describe a workflow with the Umpire package to simulate biologically realistic, mixed-type clinical data. As usual, we start by loading the package: ```{r} library(Umpire) ``` # Simulating Mixed-Type Clinical Data Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator. ```{r seed} set.seed(84503) ``` ## Model Subtypes and Survival The simulation workflow begins by simulating complex, correlated, continuous data with known "ground truth" by instantiating a ClinicalEngine. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known "ground truth" through an implementation of the Umpire 1.0 CancerModel and CancerEngine. ```{r} ce <- ClinicalEngine(20, 4, isWeighted = TRUE) summary(ce) ``` Note that the prevalences are not equal; when you use isweighted = TRUE, they are chosen from a Dirichlet distribution. Note also that the summary function describes the object as a CancerEngine, since the same underlying structure is used to implement a ClinicalEngine. Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 "components", where each component consists of a pair of correlated features. ```{r nrow} nrow(ce) nComponents(ce) ``` ## Simulate Raw Data The ClinicalEngine is used to simulate the raw, base dataset. ```{r} dset <- rand(ce, 300) ``` Data are simulated as a list with two objects: simulated data and associated clinical information, including "ground truth" subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period). ```{r} class(dset) names(dset) summary(dset$clinical) ``` The raw data are simulated as a matrix of continuous values. ```{r} class(dset$data) dim(dset$data) ``` ## Apply Clinically Realistic Noise The user may add further additive noise to the raw data. The ClinicalNoiseModel simulates additive noise for each feature _f_ and patient _i_ as a normal distribution $E_{fi} \sim N(0, \tau)$ , where the standard deviation $\tau$ varies with a hyperparameter along the gamma distribution $\tau \sim Gamma(shape, scale)$. Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the ClinicalNoiseModel is applied to blur the previously simulated data. The default model below generates a low overall level of additive noise. ```{r} cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape = 1.02, scale = 0.05) summary(cnm) noisy <- blur(cnm, dset$data) ``` ## Simulate Mixed-Type Data Umpire 2.0 allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated. The data simulated above by the ClinicalEngine and ClinicalNoiseModel takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The makeDataTypes method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns. ```{r} dt <- makeDataTypes(dset$data, pCont = 1/3, pBin = 1/3, pCat = 1/3, pNominal = 0.5, range = 3:9, inputRowsAreFeatures = TRUE) names(dt) ``` The makeDataTypes function generates a list containing two objects: a data.frame of mixed-type data... ```{r} class(dt$binned) dim(dt$binned) summary(dt$binned) ``` The cutpoints contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint. ```{r} dt$cutpoints[[1]] dt$cutpoints[[5]] ``` And here is an overview of the number of features of each type. ```{r} cp <- dt$cutpoints type <- sapply(cp, function(X) { X$Type }) table(type) ``` The cupoitns should be saved for downstream use in the MixedTypeEngine. ## The MixedTypeEngine The many parameters defining a simulated data mixture can be stored as a single MixedTypeEngine for downstream use to easily generate future datasets with the same simulation parameters. The MixedTypeEngine stores the following components for re-implementation: 1. The ClinicalEngine, including parameters for generating the subtype pattern and survival model. 2. The ClinicalNoiseModel. 3. The cutpoints generated by makeDataTypes. ```{r} mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) summary(mte) ``` With rand, the user can easily generate new data sets with the same simulation parameters. ```{r} dset2 <- rand(mte, 20) class(dset2) summary(dset2$data) summary(dset2$clinical) ``` By using the keepal argument othe function, you can keep the intermediate datasets produced by the rand method. ```{r} dset3 <- rand(mte, 25, keepall = TRUE) class(dset3) names(dset3) ``` The raw and noisy elements have the rows as (future clinical) features and the columns as patients/samples. ```{r raw} dim(dset3$raw) summary(t(dset3$raw)) dim(t(dset3$noisy)) summary(dset3$noisy) ``` Noisy data arises by adding simulated noise to the raw data. ```{r, fig.cap="Raw and noisy data."} plot(dset3$raw[5,], dset3$noisy[5,], xlab = "Raw", ylab = "Noisy", pch=16) ``` The binned element has columns as features and rows as samples. Binned data arises by applying cut points to noisy data. ```{r fig.cap = "Noisy and binned data."} dim(dset3$binned) summary(dset3$binned) plot(dset3$binned[,5], dset3$noisy[5,], xlab = "Binned", ylab = "Noisy") ```