Introduction

Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where “ground truth” is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity.

In this vignette, we describe a workflow with the Umpire package to simulate biologically realistic, mixed-type clinical data.

As usual, we start by loading the package:

library(Umpire)

Simulating Mixed-Type Clinical Data

Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator.

set.seed(84503)

Model Subtypes and Survival

The simulation workflow begins by simulating complex, correlated, continuous data with known “ground truth” by instantiating a ClinicalEngine. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known “ground truth” through an implementation of the Umpire 1.0 CancerModel and CancerEngine.

ce <- ClinicalEngine(20, 4, isWeighted = TRUE)
summary(ce)

## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw), a CancerModel object constructed via:
##    CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, 
##     nPattern = nClusters, HIT = hitfn, SURV = SURV, OUT = OUT, 
##     survivalModel = survivalModel, prevalence = Prevalence(isWeighted, 
##     nClusters)) 
## 
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
## 
## Survival effects:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.63510 -0.40700 -0.24181 -0.21561 -0.01502  0.21166 
## 
## Outcome effects:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.228756 -0.080970 -0.035408 -0.009743  0.116928  0.124044 
## --------------
## 
## Base expression given by:
## An Engine with 10 components.
## 
## Altered expression given by:
## An Engine with 10 components.

Note that the prevalences are not equal; when you use isweighted = TRUE, they are chosen from a Dirichlet distribution. Note also that the summary function describes the object as a CancerEngine, since the same underlying structure is used to implement a ClinicalEngine.

Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 “components”, where each component consists of a pair of correlated features.

nrow(ce)

## [1] 20

nComponents(ce)

## [1] 10

Simulate Raw Data

The ClinicalEngine is used to simulate the raw, base dataset.

dset <- rand(ce, 300)

Data are simulated as a list with two objects: simulated data and associated clinical information, including “ground truth” subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period).

class(dset)

## [1] "list"

names(dset)

## [1] "clinical" "data"

summary(dset$clinical)

##  CancerSubType   Outcome         LFU          Event        
##  Min.   :1.000   Bad :151   Min.   : 0.00   Mode :logical  
##  1st Qu.:1.000   Good:149   1st Qu.:19.75   FALSE:197      
##  Median :3.000              Median :31.50   TRUE :103      
##  Mean   :2.717              Mean   :33.81                  
##  3rd Qu.:4.000              3rd Qu.:49.00                  
##  Max.   :4.000              Max.   :71.00

The raw data are simulated as a matrix of continuous values.

class(dset$data)

## [1] "matrix" "array"

dim(dset$data)

## [1]  20 300

Apply Clinically Realistic Noise

The user may add further additive noise to the raw data. The ClinicalNoiseModel simulates additive noise for each feature f and patient i as a normal distribution \(E_{fi} \sim N(0, \tau)\) , where the standard deviation \(\tau\) varies with a hyperparameter along the gamma distribution \(\tau \sim Gamma(shape, scale)\). Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the ClinicalNoiseModel is applied to blur the previously simulated data. The default model below generates a low overall level of additive noise.

cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape = 1.02, scale = 0.05)
summary(cnm)

## A 'NoiseModel' with:
##   additive offset = 0
##   additive scale distributed as:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00601 0.03036 0.05852 0.06958 0.09856 0.18595 
##   multiplicative scale = 0

noisy <- blur(cnm, dset$data)

Simulate Mixed-Type Data

Umpire 2.0 allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated.

The data simulated above by the ClinicalEngine and ClinicalNoiseModel takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The makeDataTypes method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns.

dt <- makeDataTypes(dset$data,
                   pCont = 1/3, pBin = 1/3, pCat = 1/3,
                   pNominal = 0.5, range = 3:9,
                   inputRowsAreFeatures = TRUE)
names(dt)

## [1] "binned"    "cutpoints"

The makeDataTypes function generates a list containing two objects: a data.frame of mixed-type data…

class(dt$binned)

## [1] "data.frame"

dim(dt$binned)

## [1] 300  20

summary(dt$binned)

##  V1           V2              V3               V4               V5      
##  R:52   Min.   :5.191   Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
##  S:55   1st Qu.:6.490   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00  
##  T:32   Median :6.971   Median :0.0000   Median :0.0000   Median :1.00  
##  U:59   Mean   :6.965   Mean   :0.4133   Mean   :0.4067   Mean   :0.66  
##  V:50   3rd Qu.:7.400   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00  
##  W:52   Max.   :8.792   Max.   :1.0000   Max.   :1.0000   Max.   :1.00  
##                                                                         
##        V6              V7              V8              V9            V10    
##  Min.   :1.923   Min.   :5.454   Min.   :3.563   Min.   :0.00   X      :53  
##  1st Qu.:3.578   1st Qu.:6.566   1st Qu.:4.902   1st Qu.:0.00   V      :48  
##  Median :4.024   Median :7.071   Median :5.378   Median :0.00   Y      :38  
##  Mean   :4.054   Mean   :7.065   Mean   :5.300   Mean   :0.19   S      :36  
##  3rd Qu.:4.502   3rd Qu.:7.540   3rd Qu.:5.679   3rd Qu.:0.00   T      :36  
##  Max.   :5.795   Max.   :8.859   Max.   :6.873   Max.   :1.00   W      :34  
##                                                                 (Other):55  
##  V11         V12             V13            V14           V15          V16     
##  R:40   Min.   :2.797   Min.   :0.00   Min.   :0.0   G      :55   Min.   :0.0  
##  S:48   1st Qu.:4.086   1st Qu.:0.00   1st Qu.:0.0   E      :42   1st Qu.:0.0  
##  T:57   Median :4.460   Median :0.00   Median :0.0   F      :41   Median :1.0  
##  U:39   Mean   :4.440   Mean   :0.07   Mean   :0.1   D      :38   Mean   :0.7  
##  V:56   3rd Qu.:4.825   3rd Qu.:0.00   3rd Qu.:0.0   A      :34   3rd Qu.:1.0  
##  W:60   Max.   :5.798   Max.   :1.00   Max.   :1.0   C      :33   Max.   :1.0  
##                                                      (Other):57                
##       V17             V18        V19         V20       
##  Min.   :2.606   Min.   :4.993   A:52   Min.   :2.356  
##  1st Qu.:4.053   1st Qu.:6.608   B:30   1st Qu.:4.796  
##  Median :4.528   Median :7.179   C:40   Median :5.529  
##  Mean   :4.515   Mean   :7.123   D:38   Mean   :5.505  
##  3rd Qu.:4.912   3rd Qu.:7.608   E:47   3rd Qu.:6.190  
##  Max.   :6.610   Max.   :8.738   F:48   Max.   :8.570  
##                                  G:45

The cutpoints contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint.

dt$cutpoints[[1]]

## $breaks
##        0% 10.55853% 28.88869% 48.54916% 66.07294% 82.94035%      100% 
##      -Inf  4.055293  4.728019  5.149966  5.491690  5.900582       Inf 
## 
## $labels
## [1] "T" "S" "U" "W" "V" "R"
## 
## $Type
## [1] "nominal"

dt$cutpoints[[5]]

## $breaks
## [1]    -Inf 7.35767     Inf
## 
## $labels
## [1] 0 1
## 
## $Type
## [1] "symmetric binary"

And here is an overview of the number of features of each type.

cp <- dt$cutpoints
type <- sapply(cp, function(X) { X$Type })
table(type)

## type
## asymmetric binary        continuous           nominal           ordinal 
##                 1                 8                 3                 2 
##  symmetric binary 
##                 6

The cupoitns should be saved for downstream use in the MixedTypeEngine.

The MixedTypeEngine

The many parameters defining a simulated data mixture can be stored as a single MixedTypeEngine for downstream use to easily generate future datasets with the same simulation parameters.

The MixedTypeEngine stores the following components for re-implementation:

The ClinicalEngine, including parameters for generating the subtype pattern and survival model.
The ClinicalNoiseModel.
The cutpoints generated by makeDataTypes.

mte <- MixedTypeEngine(ce,
                       noise = cnm,
                       cutpoints = dt$cutpoints)
summary(mte)

## A 'MixedTypeEngine' (MTE) based on:
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw), a CancerModel object constructed via:
##    CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, 
##     nPattern = nClusters, HIT = hitfn, SURV = SURV, OUT = OUT, 
##     survivalModel = survivalModel, prevalence = Prevalence(isWeighted, 
##     nClusters)) 
## 
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
## 
## Survival effects:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.63510 -0.40700 -0.24181 -0.21561 -0.01502  0.21166 
## 
## Outcome effects:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.228756 -0.080970 -0.035408 -0.009743  0.116928  0.124044 
## --------------
## 
## Base expression given by:
## An Engine with 10 components.
## 
## Altered expression given by:
## An Engine with 10 components.
## 
## ---------------
## The MTE uses the following noise model:
## A 'NoiseModel' with:
##   additive offset = 0
##   additive scale distributed as:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00601 0.03036 0.05852 0.06958 0.09856 0.18595 
##   multiplicative scale = 0
## ---------------
## The MTE simulates clinical data of these types:

## 
## asymmetric binary        continuous           nominal           ordinal 
##                 1                 8                 3                 2 
##  symmetric binary 
##                 6

With rand, the user can easily generate new data sets with the same simulation parameters.

dset2 <- rand(mte, 20)
class(dset2)

## [1] "list"

summary(dset2$data)

## Length  Class   Mode 
##      0   NULL   NULL

summary(dset2$clinical)

##  CancerSubType Outcome        LFU          Event        
##  Min.   :1.0   Bad :14   Min.   : 7.00   Mode :logical  
##  1st Qu.:2.0   Good: 6   1st Qu.:18.75   FALSE:13       
##  Median :3.5             Median :37.00   TRUE :7        
##  Mean   :2.9             Mean   :37.35                  
##  3rd Qu.:4.0             3rd Qu.:54.75                  
##  Max.   :4.0             Max.   :70.00

By using the keepal argument othe function, you can keep the intermediate datasets produced by the rand method.

dset3 <- rand(mte, 25, keepall = TRUE)
class(dset3)

## [1] "list"

names(dset3)

## [1] "raw"      "clinical" "noisy"    "binned"

The raw and noisy elements have the rows as (future clinical) features and the columns as patients/samples.

dim(dset3$raw)

## [1] 20 25

summary(t(dset3$raw))

##        V1              V2              V3              V4        
##  Min.   :4.028   Min.   :5.816   Min.   :1.901   Min.   : 8.537  
##  1st Qu.:4.535   1st Qu.:6.560   1st Qu.:2.348   1st Qu.: 9.466  
##  Median :5.425   Median :6.903   Median :3.227   Median :10.173  
##  Mean   :5.344   Mean   :6.967   Mean   :3.393   Mean   :10.087  
##  3rd Qu.:5.893   3rd Qu.:7.295   3rd Qu.:4.194   3rd Qu.:10.607  
##  Max.   :7.708   Max.   :8.818   Max.   :5.475   Max.   :11.558  
##        V5              V6              V7              V8       
##  Min.   :6.180   Min.   :2.825   Min.   :6.297   Min.   :4.460  
##  1st Qu.:6.920   1st Qu.:3.395   1st Qu.:6.818   1st Qu.:4.967  
##  Median :7.402   Median :3.797   Median :7.160   Median :5.058  
##  Mean   :7.366   Mean   :3.894   Mean   :7.207   Mean   :5.194  
##  3rd Qu.:7.798   3rd Qu.:4.177   3rd Qu.:7.574   3rd Qu.:5.448  
##  Max.   :8.676   Max.   :5.618   Max.   :8.578   Max.   :6.076  
##        V9             V10             V11             V12       
##  Min.   :5.222   Min.   :3.932   Min.   :4.832   Min.   :3.785  
##  1st Qu.:6.182   1st Qu.:4.690   1st Qu.:6.019   1st Qu.:4.342  
##  Median :6.739   Median :5.504   Median :6.374   Median :4.566  
##  Mean   :6.704   Mean   :5.418   Mean   :6.354   Mean   :4.762  
##  3rd Qu.:7.039   3rd Qu.:5.760   3rd Qu.:6.648   3rd Qu.:5.125  
##  Max.   :8.556   Max.   :6.929   Max.   :8.148   Max.   :6.479  
##       V13             V14             V15             V16        
##  Min.   :6.050   Min.   :6.125   Min.   :4.136   Min.   : 8.014  
##  1st Qu.:7.135   1st Qu.:6.816   1st Qu.:5.638   1st Qu.: 8.626  
##  Median :7.417   Median :7.232   Median :5.940   Median : 9.244  
##  Mean   :7.434   Mean   :7.261   Mean   :6.088   Mean   : 9.071  
##  3rd Qu.:7.879   3rd Qu.:7.779   3rd Qu.:6.558   3rd Qu.: 9.471  
##  Max.   :8.551   Max.   :8.156   Max.   :7.543   Max.   :10.093  
##       V17             V18             V19             V20       
##  Min.   :3.173   Min.   :6.379   Min.   :6.508   Min.   :3.353  
##  1st Qu.:4.296   1st Qu.:6.945   1st Qu.:6.940   1st Qu.:5.250  
##  Median :4.825   Median :7.320   Median :7.200   Median :5.704  
##  Mean   :4.745   Mean   :7.391   Mean   :7.217   Mean   :5.511  
##  3rd Qu.:5.095   3rd Qu.:7.713   3rd Qu.:7.461   3rd Qu.:5.980  
##  Max.   :6.312   Max.   :8.652   Max.   :7.832   Max.   :6.405

dim(t(dset3$noisy))

## [1] 25 20

summary(dset3$noisy)

##        V1               V2               V3               V4        
##  Min.   : 2.287   Min.   : 2.790   Min.   : 3.404   Min.   : 2.941  
##  1st Qu.: 5.653   1st Qu.: 5.500   1st Qu.: 5.874   1st Qu.: 4.558  
##  Median : 6.501   Median : 6.554   Median : 6.646   Median : 6.218  
##  Mean   : 6.526   Mean   : 6.454   Mean   : 6.631   Mean   : 5.930  
##  3rd Qu.: 7.464   3rd Qu.: 7.159   3rd Qu.: 7.208   3rd Qu.: 6.799  
##  Max.   :10.206   Max.   :10.271   Max.   :11.206   Max.   :10.230  
##        V5               V6               V7              V8       
##  Min.   : 3.935   Min.   : 2.921   Min.   :3.263   Min.   :3.144  
##  1st Qu.: 5.416   1st Qu.: 5.379   1st Qu.:5.225   1st Qu.:5.209  
##  Median : 6.396   Median : 6.525   Median :6.327   Median :6.762  
##  Mean   : 6.584   Mean   : 6.472   Mean   :6.236   Mean   :6.488  
##  3rd Qu.: 7.621   3rd Qu.: 7.611   3rd Qu.:7.206   3rd Qu.:7.431  
##  Max.   :10.188   Max.   :10.608   Max.   :9.419   Max.   :9.205  
##        V9             V10              V11             V12       
##  Min.   :4.166   Min.   : 3.269   Min.   :3.477   Min.   :3.911  
##  1st Qu.:5.557   1st Qu.: 5.045   1st Qu.:5.453   1st Qu.:4.915  
##  Median :6.553   Median : 5.996   Median :6.502   Median :6.594  
##  Mean   :6.560   Mean   : 6.344   Mean   :6.422   Mean   :6.518  
##  3rd Qu.:7.577   3rd Qu.: 7.538   3rd Qu.:7.424   3rd Qu.:7.928  
##  Max.   :9.531   Max.   :10.452   Max.   :9.666   Max.   :9.389  
##       V13             V14              V15              V16       
##  Min.   :3.563   Min.   : 1.941   Min.   : 2.194   Min.   :2.342  
##  1st Qu.:4.991   1st Qu.: 4.889   1st Qu.: 5.451   1st Qu.:5.049  
##  Median :6.358   Median : 6.504   Median : 6.011   Median :6.771  
##  Mean   :6.305   Mean   : 6.221   Mean   : 6.317   Mean   :6.308  
##  3rd Qu.:7.340   3rd Qu.: 7.357   3rd Qu.: 7.125   3rd Qu.:7.561  
##  Max.   :9.488   Max.   :10.614   Max.   :11.498   Max.   :9.884  
##       V17              V18              V19              V20       
##  Min.   : 2.422   Min.   : 3.298   Min.   : 2.968   Min.   :3.418  
##  1st Qu.: 4.938   1st Qu.: 4.864   1st Qu.: 5.692   1st Qu.:5.108  
##  Median : 6.343   Median : 6.580   Median : 6.632   Median :6.509  
##  Mean   : 6.094   Mean   : 6.343   Mean   : 6.614   Mean   :6.304  
##  3rd Qu.: 7.371   3rd Qu.: 7.227   3rd Qu.: 7.567   3rd Qu.:7.192  
##  Max.   :10.057   Max.   :10.199   Max.   :11.299   Max.   :9.389  
##       V21              V22             V23             V24       
##  Min.   : 2.162   Min.   :4.072   Min.   :3.012   Min.   :3.273  
##  1st Qu.: 5.056   1st Qu.:5.482   1st Qu.:4.889   1st Qu.:5.086  
##  Median : 6.289   Median :6.592   Median :6.095   Median :6.165  
##  Mean   : 6.327   Mean   :6.533   Mean   :6.270   Mean   :6.166  
##  3rd Qu.: 7.420   3rd Qu.:7.702   3rd Qu.:7.600   3rd Qu.:6.810  
##  Max.   :11.418   Max.   :9.413   Max.   :9.530   Max.   :9.474  
##       V25        
##  Min.   : 2.118  
##  1st Qu.: 4.941  
##  Median : 6.422  
##  Mean   : 6.176  
##  3rd Qu.: 7.322  
##  Max.   :10.860

Noisy data arises by adding simulated noise to the raw data.

plot(dset3$raw[5,], dset3$noisy[5,], xlab = "Raw", ylab = "Noisy", pch=16)

Raw and noisy data.

The binned element has columns as features and rows as samples. Binned data arises by applying cut points to noisy data.

dim(dset3$binned)

## [1] 25 20

summary(dset3$binned)

##  V1          V2        V3     V4     V5           V6              V7       
##  R:6   Min.   :5.878   0:12   0:11   0:12   Min.   :2.921   Min.   :6.165  
##  S:7   1st Qu.:6.477   1:13   1:14   1:13   1st Qu.:3.418   1st Qu.:6.821  
##  T:2   Median :6.918                        Median :3.895   Median :7.197  
##  U:2   Mean   :6.959                        Mean   :3.892   Mean   :7.199  
##  V:5   3rd Qu.:7.247                        3rd Qu.:4.306   3rd Qu.:7.658  
##  W:3   Max.   :8.810                        Max.   :5.739   Max.   :8.577  
##                                                                            
##        V8        V9          V10    V11        V12        V13    V14   
##  Min.   :4.482   0:19   T      :6   R:5   Min.   :3.578   0:22   0:22  
##  1st Qu.:4.925   1: 6   R      :5   S:3   1st Qu.:4.252   1: 3   1: 3  
##  Median :5.103          X      :4   T:2   Median :4.619                
##  Mean   :5.195          S      :3   U:4   Mean   :4.771                
##  3rd Qu.:5.443          V      :2   V:7   3rd Qu.:5.154                
##  Max.   :6.113          W      :2   W:4   Max.   :6.379                
##                         (Other):3                                      
##       V15    V16         V17             V18        V19        V20       
##  D      :5   0: 8   Min.   :3.156   Min.   :6.304   A:3   Min.   :3.360  
##  F      :4   1:17   1st Qu.:4.293   1st Qu.:6.895   B:2   1st Qu.:5.263  
##  G      :4          Median :4.838   Median :7.368   C:4   Median :5.701  
##  A      :3          Mean   :4.748   Mean   :7.388   D:6   Mean   :5.496  
##  B      :3          3rd Qu.:5.105   3rd Qu.:7.768   E:6   3rd Qu.:5.971  
##  H      :3          Max.   :6.333   Max.   :8.630   F:4   Max.   :6.361  
##  (Other):3                                          G:0

plot(dset3$binned[,5], dset3$noisy[5,], xlab = "Binned", ylab = "Noisy")

Noisy and binned data.

- Introduction
- Simulating Mixed-Type Clinical Data