Title: | Simulating Realistic Gene Expression and Clinical Data |
---|---|
Description: | The Ultimate Microrray Prediction, Reality and Inference Engine (UMPIRE) is a package to facilitate the simulation of realistic microarray data sets with links to associated outcomes. See Zhang and Coombes (2012) <doi:10.1186/1471-2105-13-S13-S1>. Version 2.0 adds the ability to simulate realistic mixed-typed clinical data. |
Authors: | Kevin R. Coombes, Jiexin Zhang, Caitlin E. Coombes |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 2.0.10 |
Built: | 2025-01-08 06:57:40 UTC |
Source: | CRAN |
addControl
is a generic function used to add a control group to
a simulated patient cohort. Implementations exists for a
CancerModel
and for a CancerEngine
.
## S4 method for signature 'ANY' addControl(object, fraction = 0.5, ...) ## S4 method for signature 'CancerModel' addControl(object, fraction = 0.5, ...) ## S4 method for signature 'CancerEngine' addControl(object, fraction = 0.5, ...)
## S4 method for signature 'ANY' addControl(object, fraction = 0.5, ...) ## S4 method for signature 'CancerModel' addControl(object, fraction = 0.5, ...) ## S4 method for signature 'CancerEngine' addControl(object, fraction = 0.5, ...)
object |
an object to which adding a contrl group is desired |
fraction |
a real number between zero oand one; the fraction of the final cohort that shoudl consist of controls |
... |
additional arguments; not yet used |
Returns a new object of the same class as its first argument.
Kevin R. Coombes [email protected],
alterMean
and alterSD
are generic functions used to alter
means or standard deviations, respectively, based on the input object.
Each generic functions invokes different
methods
which depend on the class
of the
first argument.
## S4 method for signature 'ANY' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'ANY' alterSD(object, TRANSFORM, ...)
## S4 method for signature 'ANY' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'ANY' alterSD(object, TRANSFORM, ...)
object |
an object for which altering mean or standard deviation is desired |
TRANSFORM |
function that returns its transformed input |
... |
additional arguments affecting the specific transformation performed |
The form of the value returned by alterMean
or alterSD
depends on the class of its argument. See the documentation of the
particular methods for details of what is produced by that method.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Provides tools to create a CancerEngine with block correlation structure. Also makes it possible to simulate paired clinical and gene expression data with this block structure.
BlockHyperParameters(nExtraBlocks = 100, meanBlockSize = 100, sigmaBlockSize = 30, minBlockSize = 5, mu0 = 6, sigma0 = 1.5, rate = 28.11, shape = 44.25, p.cor = 0.6, wt.cor = 5) makeBlockStructure(cm, hyperp, xform = normalOffset, ...)
BlockHyperParameters(nExtraBlocks = 100, meanBlockSize = 100, sigmaBlockSize = 30, minBlockSize = 5, mu0 = 6, sigma0 = 1.5, rate = 28.11, shape = 44.25, p.cor = 0.6, wt.cor = 5) makeBlockStructure(cm, hyperp, xform = normalOffset, ...)
cm |
object of class |
hyperp |
object of class |
nExtraBlocks |
integer scalar specifying number of blocks not
involved in the "hit" structure defined by the |
meanBlockSize |
numeric scalar specifying mean number of genes in a correlated block |
sigmaBlockSize |
numeric scalar specifying standard deviation of the number of genes in a correlated block |
minBlockSize |
integer scalar specifying minimal number of genes in a correlated block |
mu0 |
numeric scalar specifying expected mean expression level of a gene on the log scale |
sigma0 |
numeric scalar specifying standard deviation of the mean expression level of a gene on the log scale |
rate |
numeric scalar specifying one of the gamma parameters |
shape |
numeric scalar specifying one of the gamma parameters |
p.cor |
numeric scalar specifying expected correlation within each block |
wt.cor |
numeric scalar specifying weight given to the expected block correlation |
xform |
A function that will be passed to the |
... |
extra arguments that wil be passed back to the
|
Our standard model for gene expression in a homogeneous sample assumes
that the overall correlation matrix is block diagonal. Correlation
between genes in different blocks is assumed to be zero. Correlation
for genes in the same block is assumed to be a constant, but different
correlation constants can be used in different blocks. The actual
correlations are assumed to arise from a beta distribution of the form
Beta(pw, (1-p)w), where p=p.cor
and w=wt.cor
are two of the
hyperparameters.
The number of blocks is determined jointly by the CancerModel
,
cm
, and the hyperparameter nExtraBlocks
. The size of a
block is assumed to arise from a normal distribution with mean given
by meanBlockSize
and standard deviaion given by
sigmaBlockSize
. To avoid accidentally assigning non-postive
block sizes, this distribution is truncated below by
minBlockSize
.
The expression of each gene is assumed to come from a log-normal
distribution with parameters describing the per-gene mean ()
and standard deviation (
) n the log scale. These
parameters, in turn, are assumed to come from hyperdistributions.
Specifically, we assume that
comes from a normal distribution
with mean
mu0
and standard deviation sigma0
. We also
assume that comes from an inverse gamma distribution with
parameters
rate
and shape
.
The BlockHyperParameters
class simply bundles the parameters for
this model into a single structure. The default values are consistent
with data we have seen from several Affymetrix microarray studies.
The makeBlockStructure
function takes a CancerModel
and
a BlockHyperParameters
object as arguments and produces a
CancerEngine
object. The rand
method for
this class can be used to generate matched clinical data (with the
structure defined by the CancerModel
object) and gene
expression data with the specified block correlation structure.
The BlockHyperParameters
generator returns an object of class
BlockHyperParameters
.
The function makeBlockStructure
returns an object of the
CancerEngine
class.
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
BlockHyperParameters
generator function.
nExtraBlocks
:An integer; the number of blocks not involved in
the "hit" structure defined by the CancerModel
.
meanBlockSize
:A real number; the mean number of genes in a correlated block.
sigmaBlockSize
:A real number; standard deviation of the number of genes in a correlated block.
minBlockSize
:An integer; the minimal number of genes in a correlated block.
mu0
:A real number; the expected mean expression level of a gene on the log scale.
sigma0
:A real number; the standard deviation of the mean expression level of a gene on the log scale.
rate
:Gamma parameter; see details.
shape
:Gamma parameter; see details.
p.cor
:A real number; the expected correlation within each block.
wt.cor
:A real number; the weight given to the expected block correlation.
There are no special methods defind for this class.
Kevin R. Coombes [email protected],
showClass("BlockHyperParameters") sm <- SurvivalModel(baseHazard = 1/3, units = 52, unitName = "weeks") cm <- CancerModel("myModel", nPossible = 10, nPattern = 5, survivalModel = sm) hyper <- BlockHyperParameters() engine <- makeBlockStructure(cm, hyper) outcome <- rand(engine, 100) summary(outcome$clinical) dim(outcome$data)
showClass("BlockHyperParameters") sm <- SurvivalModel(baseHazard = 1/3, units = 52, unitName = "weeks") cm <- CancerModel("myModel", nPossible = 10, nPattern = 5, survivalModel = sm) hyper <- BlockHyperParameters() engine <- makeBlockStructure(cm, hyper) outcome <- rand(engine, 100) summary(outcome$clinical) dim(outcome$data)
blur
is a generic function used to add noise to a signal as defined
by various objects. The generic function invokes different methods
which depend on the class
of the first argument.
## S4 method for signature 'ANY' blur(object, x, ...)
## S4 method for signature 'ANY' blur(object, x, ...)
object |
an object from which adding noise to its signal is desired |
x |
matrix containing signal to be affected |
... |
additional arguments affecting the noise addition |
The form of the value returned by blur
depends on the
class of its argument. See the documentation of the particular methods
for details of what is produced by that method.
Kevin R. Coombes [email protected],
A CancerEngine combines a CancerModel (which defines the combinatorics of hits that produce cancer subtypes) with a pair of gene expression Engines that can be used to simulate microarray data depending on the presence or absence of different hits.
CancerEngine(cm, base, altered) ## S4 method for signature 'CancerEngine' summary(object, ...) ## S4 method for signature 'CancerEngine' nrow(x) ## S4 method for signature 'CancerEngine' nComponents(object, ...) ## S4 method for signature 'CancerEngine' rand(object, n, ...) ClinicalEngine(nFeatures, nClusters, isWeighted, bHyp = NULL, survivalModel = NULL, SURV = function(n) rnorm(n, 0, 0.3), OUT = function(n) rnorm(n, 0, 0.3))
CancerEngine(cm, base, altered) ## S4 method for signature 'CancerEngine' summary(object, ...) ## S4 method for signature 'CancerEngine' nrow(x) ## S4 method for signature 'CancerEngine' nComponents(object, ...) ## S4 method for signature 'CancerEngine' rand(object, n, ...) ClinicalEngine(nFeatures, nClusters, isWeighted, bHyp = NULL, survivalModel = NULL, SURV = function(n) rnorm(n, 0, 0.3), OUT = function(n) rnorm(n, 0, 0.3))
cm |
object of class |
base |
character string giving the name of an |
altered |
character string giving the name of an |
object |
object of class |
x |
object of class |
n |
numeric scalar representing number of samples to be simulated |
... |
extra arguments for generic routines |
nFeatures |
an integer; the number of simulated clinical features |
nClusters |
an integer; the number of simulated clusters or subtypes |
isWeighted |
a logical value; used to determine whether the prevalence of subtypes is equal (unweighted) or unequal (weighted). |
bHyp |
an object of the class |
survivalModel |
an object of the |
SURV |
a function; the same as used in a |
OUT |
a function; the same as used in a |
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
CancerEngine
or ClinicalEngine
generator functions.
cm
:A CancerModel
object.
base
:Object of class "character" giving the name of
an Engine
or EngineWithActivity
.
Represents the expected gene expression in the absence of any hits.
altered
:Object of class "character" giving the name of
an Engine
or EngineWithActivity
.
Represents the expected gene expression in the presence of hits.
localenv
:Object of class "environment"
; used
in the internal implementation.
signature(object = "CancerEngine")
: returns a list
containing two named components, clinical
and data
, The
clinical element is a data frame generated from the underlying
CancerModel
, and the data element is a matrix generated by the
gene expression engines, altered by the appropriate "hits" present in
each simulated individual.
signature(object = "CancerEngine")
: print
summaries of the underlying cancer models and gene expression engines
in the cancer engine.
Both the CancerEngine
and ClinicalEngine
constructors return objects of the CancerEngine
class. The only
practical differences are the default parameters set by the constructors.
The primary change is in the CancerModel
slot. The
CancerEngine
expects you to construct your own CancerModel
explicitly before producing a CancerEngine
.
By contrast, the ClinicalEngine
expects to use many fewer
features, so gives a simpler interface and uses its parameters to construct the
CancerModel
for you. The "HIT" function will use 2 hits per subtype
when there are fewer than 15 features, 3 hits between 15 and 45 features,
and the older default of 5 hits when there are more than 45 features. The
total number of possible hits is set equal to the number of features (N) when
there are fewer than 12 features, approximately N/3 up to 50 features,
approximately N/5 up to 100, and is then locked at 20. If isWeighted
is TRUE
, subtype prevalences are chosen from a Dirichlet
distribution. Otherwise, each subtype is equally likely.
The nrow
and nComponents
methods both return
non-negative integer values describing the number of rows (features)
and the number of components of the underlying gene expression or
clinical data Engines. The rand
method returns a matrix with
n
columns of smiulated data.
Kevin R. Coombes [email protected], Caitlin E. Coombes [email protected]
Zhang J, Coombes KR.
Sources of variation in false discovery rate estimation include
sample size, correlation, and inherent differences between groups.
BMC Bioinformatics. 2012; 13 Suppl 13:S1.
showClass("CancerEngine") set.seed(391629) ## Set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) ## Build a CancerModel with 6 subtypes nBlocks <- 20 # number of possible hits cm <- CancerModel(name="cansim", nPossible=nBlocks, nPattern=6, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) ## Include 100 blocks/pathways that are not hit by cancer nTotalBlocks <- nBlocks + 100 ## Assign values to hyperparameters ## block size blockSize <- round(rnorm(nTotalBlocks, 100, 30)) ## log normal mean hypers mu0 <- 6 sigma0 <- 1.5 ## log normal sigma hypers rate <- 28.11 shape <- 44.25 ## block corr p <- 0.6 w <- 5 ## Set up the baseline Engine rho <- rbeta(nTotalBlocks, p*w, (1-p)*w) base <- lapply(1:nTotalBlocks, function(i) { bs <- blockSize[i] co <- matrix(rho[i], nrow=bs, ncol=bs) diag(co) <- 1 mu <- rnorm(bs, mu0, sigma0) sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1) covo <- co *(t(sigma) %*% sigma) MVN(mu, covo) }) eng <- Engine(base) ## Alter the means if there is a hit altered <- alterMean(eng, normalOffset, delta=0, sigma=1) ## Build the CancerEngine using character strings object <- CancerEngine(cm, "eng", "altered") ## Or build it using the actual Engine components ob <- CancerEngine(cm, eng, altered) summary(object) summary(ob) ## Simulate the data dset <- rand(object, 20) summary(dset$clinical) summary(dset$data[, 1:3])
showClass("CancerEngine") set.seed(391629) ## Set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) ## Build a CancerModel with 6 subtypes nBlocks <- 20 # number of possible hits cm <- CancerModel(name="cansim", nPossible=nBlocks, nPattern=6, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) ## Include 100 blocks/pathways that are not hit by cancer nTotalBlocks <- nBlocks + 100 ## Assign values to hyperparameters ## block size blockSize <- round(rnorm(nTotalBlocks, 100, 30)) ## log normal mean hypers mu0 <- 6 sigma0 <- 1.5 ## log normal sigma hypers rate <- 28.11 shape <- 44.25 ## block corr p <- 0.6 w <- 5 ## Set up the baseline Engine rho <- rbeta(nTotalBlocks, p*w, (1-p)*w) base <- lapply(1:nTotalBlocks, function(i) { bs <- blockSize[i] co <- matrix(rho[i], nrow=bs, ncol=bs) diag(co) <- 1 mu <- rnorm(bs, mu0, sigma0) sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1) covo <- co *(t(sigma) %*% sigma) MVN(mu, covo) }) eng <- Engine(base) ## Alter the means if there is a hit altered <- alterMean(eng, normalOffset, delta=0, sigma=1) ## Build the CancerEngine using character strings object <- CancerEngine(cm, "eng", "altered") ## Or build it using the actual Engine components ob <- CancerEngine(cm, eng, altered) summary(object) summary(ob) ## Simulate the data dset <- rand(object, 20) summary(dset$clinical) summary(dset$data[, 1:3])
A CancerModel
object contains a number of pieces of information
representing an abstract, heterogeneous collection of cancer patients.
It can be used to simulate patient outcome data linked to hit classes.
CancerModel(name, nPossible, nPattern, HIT = function(n) 5, SURV = function(n) rnorm(n, 0, 0.3), OUT = function(n) rnorm(n, 0, 0.3), survivalModel=NULL, prevalence=NULL) nPatterns(object) nPossibleHits(object) nHitsPerPattern(object) outcomeCoefficients(object) survivalCoefficients(object) ## S4 method for signature 'CancerModel' ncol(x) ## S4 method for signature 'CancerModel' nrow(x) ## S4 method for signature 'CancerModel' rand(object, n, balance = FALSE, ...) ## S4 method for signature 'CancerModel' summary(object, ...)
CancerModel(name, nPossible, nPattern, HIT = function(n) 5, SURV = function(n) rnorm(n, 0, 0.3), OUT = function(n) rnorm(n, 0, 0.3), survivalModel=NULL, prevalence=NULL) nPatterns(object) nPossibleHits(object) nHitsPerPattern(object) outcomeCoefficients(object) survivalCoefficients(object) ## S4 method for signature 'CancerModel' ncol(x) ## S4 method for signature 'CancerModel' nrow(x) ## S4 method for signature 'CancerModel' rand(object, n, balance = FALSE, ...) ## S4 method for signature 'CancerModel' summary(object, ...)
name |
character string specifying name given to this model |
object , x
|
object of class |
nPossible |
integer scalar specifying number of potential hits relevant to the kind of cancer being modeled |
nPattern |
integer scalar specifying number of different cancer subtypes |
HIT |
function that generates non-negative integers from a discrete distribution. Used to determine the number of hits actually present in each cancer subtype. |
SURV |
function that generates real numbers from a continuous distributions. Used in simulations to select the coefficients associated with each hit in Cox proportional hazards models. |
OUT |
function that generates real numbers from a continuous distributions. Used in simulations to select the coefficients associated with each hit in logistic models of a binary outcome. |
survivalModel |
object of class |
prevalence |
optional numeric vector of relative prevalences of cancer subtypes |
n |
numeric scalar specifying quantity of random numbers |
balance |
logical scalar specifying how patients should be simulated |
... |
extra arguments for generic routines |
The rand
method is the most important method for objects of this
class. It returns a data frame with four columns: the
CancerSubType
(as an integer that indexes into the
hitPattern
slot of the object), a binary Outcome
that
takes on values "Bad"
or "Good"
, an LFU
column
with censored survival times, and a logical Event
column that
describes whether the simulated survival event has occurred.
The rand
method for the CancerModel
class adds an extra
logical parameter, balance
, to the signature specified by the
default method. If balance = FALSE
(the default), then
patients are simulated based on the prevalence
defined as part
of the model. If balance = TRUE
, then patients are simulated
with equal numbers in each hit pattern class, ordered by the hit
pattern class.
The CancerModel
function is used to contruct and return an object of
the CancerModel
class.
The ncol
and nrow
functions return integers with the size of
the matrix of hit patterns.
The rand
method returns data frame with four columns:
CancerSubType |
integer index into object's 'hitPattern' slot |
Outcome |
outcomes with values "Bad" or "Good" |
LFU |
censored survival times |
Event |
has simulated survival event has occurred? |
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
CancerModel
generator function.
name
:Object of class "character"
hitPattern
:Object of class "matrix"
survivalBeta
:Object of class "numeric"
containing the coeffieicents associated with each hit in a
Cox proportional hazards model of survival.
outcomeBeta
:Object of class "numeric"
containing the coefficients associated with each hit in a logistic
model to predict a binary outcome.
prevalence
:Object of class "numeric"
containing the prevalence of each cancer subtype.
survivalModel
:Object of class "survivalModel"
containing parameters used to simualte survival times.
call
:object of class "call"
recording the
function call used to initialize the object.
signature(x = "CancerModel")
: ...
signature(x = "CancerModel")
: ...
signature(object = "CancerModel")
: ...
signature(object = "CancerModel")
: ...
Kevin R. Coombes [email protected],
Zhang J, Coombes KR.
Sources of variation in false discovery rate estimation include
sample size, correlation, and inherent differences between groups.
BMC Bioinformatics. 2012; 13 Suppl 13:S1.
showClass("CancerModel") set.seed(391629) # set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) # now build a CancerModel with 6 subtypes cm <- CancerModel(name="cansim", nPossible=20, nPattern=6, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) # simulate 100 patients clinical <- rand(cm, 100) summary(clinical)
showClass("CancerModel") set.seed(391629) # set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) # now build a CancerModel with 6 subtypes cm <- CancerModel(name="cansim", nPossible=20, nPattern=6, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) # simulate 100 patients clinical <- rand(cm, 100) summary(clinical)
A ClinicalNoiseModel
represents the additional human and measurement
noise that is layered on top of any biological variabilty when measuring
clinical variables.
ClinicalNoiseModel(nFeatures, shape = 1.02, scale = 0.05/shape)
ClinicalNoiseModel(nFeatures, shape = 1.02, scale = 0.05/shape)
nFeatures |
An integer; the number of additive scale parameters to sample from the gamma distribution. |
shape |
The |
scale |
The |
We model both additive and multiplicative noise, so that the observed
expression of clinical variable c in sample i is given by:
, where Y_ci = observed expression,
S_ci = true biological signal.
In the ClinicalNoiseModel (as opposed to the
NoiseModel
),
we model the additive noise as ,
without multiplicative noise or an additive bias/offset in the clinical model.
The standard deviation hyperparameters of the additive noise tau
is modeled by the gamma distribution
An object of class NoiseModel
.
Kevin R. Coombes [email protected], Caitlin E. Coombes [email protected]
showClass("NoiseModel") ## generate a ClinicalEngine with 20 features and 4 clusters ce <- ClinicalEngine(20, 4, TRUE) ## generate 300 simulated patients set.seed(194718) dset <- rand(ce, 300) cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape=2, scale=0.1) cnm noisy <- blur(cnm, dset$data) hist(noisy)
showClass("NoiseModel") ## generate a ClinicalEngine with 20 features and 4 clusters ce <- ClinicalEngine(20, 4, TRUE) ## generate 300 simulated patients set.seed(194718) dset <- rand(ce, 300) cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape=2, scale=0.1) cnm noisy <- blur(cnm, dset$data) hist(noisy)
The Engine
class is a tool (i.e., an algorithm) used to simulate
vectors of gene expression from some underlying distribution.
Engine(components) ## S4 method for signature 'Engine' nComponents(object, ...) ## S4 method for signature 'Engine' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'Engine' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'Engine' nrow(x) ## S4 method for signature 'Engine' rand(object, n, ...) ## S4 method for signature 'Engine' summary(object, ...)
Engine(components) ## S4 method for signature 'Engine' nComponents(object, ...) ## S4 method for signature 'Engine' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'Engine' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'Engine' nrow(x) ## S4 method for signature 'Engine' rand(object, n, ...) ## S4 method for signature 'Engine' summary(object, ...)
components |
object of class |
object , x
|
object of class |
TRANSFORM |
function takes as its input a vector of mean expression or standard deviation and returns a transformed vector that can be used to alter the appropriate slot of the object. |
n |
numeric scalar representing number of samples to be simulated |
... |
extra arguments for generic or plotting routines |
In most cases, an engine object is an instantiation of a more general family or class that we call an ABSTRACT ENGINE. Every abstract engine is an ordered list of components, which can also be thought of as an engine with parameters. We instantiate an engine by binding all the free parameters of an abstract engine to actual values. Note that partial binding (of a subset of the free parameters) produces another abstract engine.
For all practical purposes, a COMPONENT should be viewed as an irreducible abstract engine. Every component must have an IDENTIFIER that is unique within the context of its enclosing abstract engine. The identifer may be implicitly taken to be the position of the component in the ordered list.
We interpret an Engine
as the gene expression generator for a
homogenous population; effects of cancer on gene expression are modeled
at a higher level.
The Engine
generator returns an object of class Engine
.
The alterMean
method returns an object of class Engine
with
altered mean.
The alterSD
method returns an object of class Engine
with
altered standard deviation.
The nrow
method returns the number of genes (i.e, the length of the
vector) the Engine
object will generate.
The rand
method returns matrix representing the
expressions of
nrow(Engine)
genes and n
samples.
The summary
method prints out the number of components included
in the Engine
object.
The nComponents
method returns the number of components in the
Engine
object.
Objects can be created by calls of the form new("Engine",
components=components)
, or use the Engine
generator function.
Every engine is an ordered list of components, which generates a contiguous
subvector of the total vector of gene expression.
Takes an object of class
Engine
, loops over the components in the Engine
, alters
the mean as defined by TRANSFORM
function, and returns a modified
object of class Engine
.
Takes an object of class
Engine
, loops over the components in the Engine
, alters
the standard deviation as defined by TRANSFORM
function, and
returns a modified object of class Engine
.
Counts the total number of genes (i.e, the
length of the vector the Engine
will generate).
Generates matrix
representing gene expressions of
n
samples following the
underlying distribution captured in the object of Engine
.
Prints out the number of components included
in the object of Engine
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Zhang J, Coombes KR.
Sources of variation in false discovery rate estimation include
sample size, correlation, and inherent differences between groups.
BMC Bioinformatics. 2012; 13 Suppl 13:S1.
showClass("Engine") nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) nrow(myEngine) nComponents(myEngine) summary(myEngine) myData <- rand(myEngine, 5) dim(myData) summary(myData) OFFSET <- 2 myEngine.alterMean <- alterMean(myEngine, function(x){x+OFFSET}) myData.alterMean <- rand(myEngine.alterMean, 5) summary(myData.alterMean) SCALE <- 2 myEngine.alterSD <- alterSD(myEngine, function(x){x*SCALE}) myData.alterSD <- rand(myEngine.alterSD, 5) summary(myData.alterSD)
showClass("Engine") nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) nrow(myEngine) nComponents(myEngine) summary(myEngine) myData <- rand(myEngine, 5) dim(myData) summary(myData) OFFSET <- 2 myEngine.alterMean <- alterMean(myEngine, function(x){x+OFFSET}) myData.alterMean <- rand(myEngine.alterMean, 5) summary(myData.alterMean) SCALE <- 2 myEngine.alterSD <- alterSD(myEngine, function(x){x*SCALE}) myData.alterSD <- rand(myEngine.alterSD, 5) summary(myData.alterSD)
The EngineWithActivity
is used to set some components in the object
of class Engine
to be transcriptionally inactive and transform the
expression data to appropriate logarithmic scale.
EngineWithActivity(active, components, base=2) ## S4 method for signature 'EngineWithActivity' rand(object, n, ...) ## S4 method for signature 'EngineWithActivity' summary(object, ...)
EngineWithActivity(active, components, base=2) ## S4 method for signature 'EngineWithActivity' rand(object, n, ...) ## S4 method for signature 'EngineWithActivity' summary(object, ...)
active |
logical vector with length equal to number of components specifying whether each component should be transcriptionally active, or a numeric scalar specifying the probability for a component to be active |
components |
list where each element contains the parameters for the underlying distribution that the gene expression follows |
base |
numeric scalar specifying the logarithmic scale to which the data should be transformed |
object |
object of class |
n |
number of samples to be simulated |
... |
extra arguments for generic routines |
An ENGINE WITH ACTIVITY allows for the possibility that some components (or genes) in an expression engine (or tissue) might be transcriptionally inactive. Thus, the true biological signal S_gi should really be viewed as a mixture:
where delta_0 = a point mass at zero; T_gi = a random variable supported on the positive real line; z_g ~ Binom(pi) defines the activity state (1 = on, 0 = off)
The rand
method for an EngineWithActivity is a little bit
tricky, since we do two things at once. First, we use the
base
slot to exponentiate the random variables generated by
the underlying Engine on the log scale. We treat base = 0
as
a special case, which means that we should continue to work on
the scale of the Engine. Second, we mask any inactive component
by replacing the generated values with 0
.
Note that this is terribly inefficient if we only have a single homogeneous population, since we generate a certain amount of data only to throw it away. The power comes when we allow cancer disregulation to turn a block on or off, when the underlying data reappears.
The EngineWithActivity
generator returns an object of class
EngineWithActivity
.
The rand
method returns gene
expression matrix with the inactive components being masked by
0
.
The summary
method prints out the total number of components and
the number of active components in the object of EngineWithActivity
.
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
EngineWithActivity
generator function.
active
:logical vector specifying whether each component should be transcriptionally active or not
base
:numeric scalar specifying the logarithmic scale
components
:list specifying the parameters of the underlying distribution
Class Engine
, directly.
Generates nrow(EngineWithActivity)*n matrix
representing gene expressions of n
samples, and the
transcriptionally inactive components are masked by 0
.
Prints out the total number of
components and the number of active components in the object
of EngineWithActivity
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
showClass("EngineWithActivity") nComponents <- 10 nGenes <- 100 active <- 0.7 comp <- list() for (i in 1:nComponents) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComponents, 6, 1.5), 1/rgamma(nGenes/nComponents, 44, 28)) } myEngine <- EngineWithActivity(active, comp, 2) summary(myEngine) myData <- rand(myEngine, 5) dim(myData)
showClass("EngineWithActivity") nComponents <- 10 nGenes <- 100 active <- 0.7 comp <- list() for (i in 1:nComponents) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComponents, 6, 1.5), 1/rgamma(nGenes/nComponents, 44, 28)) } myEngine <- EngineWithActivity(active, comp, 2) summary(myEngine) myData <- rand(myEngine, 5) dim(myData)
The IndependentLogNormal
class is a tool used to generate gene
expressions that follow log normal distribution, because the true expression
value follows log normal in our model.
IndependentLogNormal(logmu,logsigma) ## S4 method for signature 'IndependentLogNormal' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'IndependentLogNormal' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'IndependentLogNormal' nrow(x) ## S4 method for signature 'IndependentLogNormal' rand(object, n, ...) ## S4 method for signature 'IndependentLogNormal' summary(object, ...)
IndependentLogNormal(logmu,logsigma) ## S4 method for signature 'IndependentLogNormal' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'IndependentLogNormal' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'IndependentLogNormal' nrow(x) ## S4 method for signature 'IndependentLogNormal' rand(object, n, ...) ## S4 method for signature 'IndependentLogNormal' summary(object, ...)
logmu |
numeric vector specifying the mean expression values on the logarithmic scale. |
logsigma |
numeric vector specifying the standard deviation of the gene expression values on the logarithmic scale |
object , x
|
object of class |
TRANSFORM |
function that takes a vector of mean expression or standard deviation and returns a transformed vector that can be used to alter the appropriate slot of the object. |
n |
numeric scalar specifying number of samples to be simulated |
... |
extra arguments for generic or plotting routines |
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
IndependentLogNormal
generator function.
logmu
:numeric vector containing the mean expression values on the logarithmic scale
logsigma
:numeric vector containing the standard deviation of the gene expression values on the logarithmic scale
Returns the number of genes (i.e, the length of the
logmu
vector).
Generates nrow(IndependentLogNormal)*n matrix
representing gene expressions of n samples following log normal
distribution captured in the object of IndependentLogNormal
.
Prints out the number of independent log
normal random variables in the object of IndependentLogNormal
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Engine
,
IndependentNormal
,
MVN
showClass("IndependentLogNormal") nGenes <- 20 logmu <- rnorm(nGenes, 6, 1) logsigma <- 1/rgamma(nGenes, rate=14, shape=6) ln <- IndependentLogNormal(logmu, logsigma) nrow(ln) summary(ln) if (any(logmu - ln@logmu)) { print('means do not match') } else { print('means verified') } if (any(logsigma - ln@logsigma)) { print('standard deviations do not match') } else { print('sd verified') } x <- rand(ln, 1000) print(dim(x)) print(paste("'ln' should be valid:", validObject(ln))) ln@logsigma <- 1:3 # now we break it print(paste("'ln' should not be valid:", validObject(ln, test=TRUE))) tmp.sd <- sqrt(apply(log(x), 1, var)) plot(tmp.sd, logsigma) tmp.mu <- apply(log(x), 1, mean) plot(tmp.mu, logmu) rm(nGenes, logmu, logsigma, ln, x, tmp.mu, tmp.sd)
showClass("IndependentLogNormal") nGenes <- 20 logmu <- rnorm(nGenes, 6, 1) logsigma <- 1/rgamma(nGenes, rate=14, shape=6) ln <- IndependentLogNormal(logmu, logsigma) nrow(ln) summary(ln) if (any(logmu - ln@logmu)) { print('means do not match') } else { print('means verified') } if (any(logsigma - ln@logsigma)) { print('standard deviations do not match') } else { print('sd verified') } x <- rand(ln, 1000) print(dim(x)) print(paste("'ln' should be valid:", validObject(ln))) ln@logsigma <- 1:3 # now we break it print(paste("'ln' should not be valid:", validObject(ln, test=TRUE))) tmp.sd <- sqrt(apply(log(x), 1, var)) plot(tmp.sd, logsigma) tmp.mu <- apply(log(x), 1, mean) plot(tmp.mu, logmu) rm(nGenes, logmu, logsigma, ln, x, tmp.mu, tmp.sd)
The IndependentNormal
class is a tool used to generate gene
expressions that follow independent normal distribution.
IndependentNormal(mu,sigma) ## S4 method for signature 'IndependentNormal' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'IndependentNormal' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'IndependentNormal' nrow(x) ## S4 method for signature 'IndependentNormal' rand(object, n, ...) ## S4 method for signature 'IndependentNormal' summary(object, ...)
IndependentNormal(mu,sigma) ## S4 method for signature 'IndependentNormal' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'IndependentNormal' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'IndependentNormal' nrow(x) ## S4 method for signature 'IndependentNormal' rand(object, n, ...) ## S4 method for signature 'IndependentNormal' summary(object, ...)
mu |
numeric vector specifying the mean expression values |
sigma |
numeric vector specifying the standard deviation of the gene expression values |
object , x
|
object of class |
TRANSFORM |
function that takes a vector of mean expression or standard deviation and returns a transformed vector that can be used to alter the appropriate slot of the object. |
n |
numeric scalar specifying number of samples to be simulated |
... |
extra arguments for generic or plotting routines |
Note that we typically work on expression value with its logarithm to some appropriate base. That is, the independent normal should be used on the logarithmic scale in order to construct the engine.
Objects can be created by using the IndependentNormal
generator
function. The object of class IndependentNormal
contains the mean
and standard deviation for the normal distribution
mu
:see corresponding argument above
sigma
:see corresponding argument above
Takes an object of class
IndependentNormal
, loops over the mu
slot, alters
the mean as defined by TRANSFORM
function, and returns an
object of class IndependentNormal
with altered mu
.
Takes an object of class
IndependentNormal
, loops over the sigma
slot, alters
the standard deviation as defined by TRANSFORM
function, and
returns an object of class IndependentNormal
with altered
sigma
.
Returns the number of genes (i.e, the length of the
mu
vector).
Generates
matrix representing gene expressions of
n
samples following the
normal distribution captured in the object of IndependentNormal
.
Prints out the number of independent normal
random variables in the object of IndependentNormal
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Engine
,
IndependentLogNormal
,
MVN
showClass("IndependentNormal") nGenes <- 20 mu <- rnorm(nGenes, 6, 1) sigma <- 1/rgamma(nGenes, rate=14, shape=6) ind <- IndependentNormal(mu, sigma) nrow(ind) summary(ind) if (any(mu - ind@mu)) { print('means do not match') } else { print('means verified') } if (any(sigma - ind@sigma)) { print('standard deviations do not match') } else { print('sd verified') } x <- rand(ind, 3) print(dim(x)) print(summary(x)) print(paste("'ind' should be valid:", validObject(ind))) ind@sigma <- 1:3 # now we break it print(paste("'ind' should not be valid:", validObject(ind, test=TRUE))) rm(nGenes, mu, sigma, ind, x)
showClass("IndependentNormal") nGenes <- 20 mu <- rnorm(nGenes, 6, 1) sigma <- 1/rgamma(nGenes, rate=14, shape=6) ind <- IndependentNormal(mu, sigma) nrow(ind) summary(ind) if (any(mu - ind@mu)) { print('means do not match') } else { print('means verified') } if (any(sigma - ind@sigma)) { print('standard deviations do not match') } else { print('sd verified') } x <- rand(ind, 3) print(dim(x)) print(summary(x)) print(paste("'ind' should be valid:", validObject(ind))) ind@sigma <- 1:3 # now we break it print(paste("'ind' should not be valid:", validObject(ind, test=TRUE))) rm(nGenes, mu, sigma, ind, x)
The makeDataTypes function allows the user to discretize a continuous data set generated from an engine of any type into binary, nominal, ordinal, or mixed-type data.
makeDataTypes(dataset, pCont, pBin, pCat, pNominal = 0.5, range = c(3, 9), exact = FALSE, inputRowsAreFeatures = TRUE) getDataTypes(object) getDaisyTypes(object)
makeDataTypes(dataset, pCont, pBin, pCat, pNominal = 0.5, range = c(3, 9), exact = FALSE, inputRowsAreFeatures = TRUE) getDataTypes(object) getDaisyTypes(object)
dataset |
A matrix or dataframe of continuous data. |
pCont |
Non-negative, numeric probability equal to or between 0 and 1 describing desired percent frequency of continuous features. |
pBin |
Non-negative, numeric probability equal to or between 0 and 1 describing desired percent frequency of binary features. |
pCat |
Non-negative, numeric probability equal to or between 0 and 1 describing desired percent frequency of categorical features. |
pNominal |
Non-negative, numeric probability equal to or between 0 and 1 describing desired percent frequency of categorical features to be simulated as nominal. |
range |
A set of integers whose minimum and maximum determine the range of the number of levels to be associated with simulated categorical factors. |
exact |
A logical value; should the parameters |
inputRowsAreFeatures |
Logical value indicating if features are to be simulated as rows (TRUE) or columns (FALSE). |
object |
An object of the |
The makeDataTypes
function is a critical step in the
construction of a MixedTypeEngine
, which is an object
that is used to simulate clinical mixed-type data instead of gene
expression data. The standard process, as illustrated in the example,
involves (1) creating a ClinicalEngine
, (2) generating a
random data set from that engine, (3) adding noise, (4) setting the
data types, and finally (5) creating the MixedTypeEngine
.
The main types of data (continuous, binary, or categorical) are
randomly assigned using the probability parameters pCont
,
pBin
, and pCat
. To choose the splitting point for the
binary features, we compute the bimodality index (Wang et al.). If that
is significant, we split the data halfway between the two
modes. Otherwise, we choose a random split point between 5% and
35%. Categorical data is also randomly assigned to ordinal or nominal
based on the pNominal
parameter. The number of levels is
uniformly selected in the specified range
, and the fraction of
data points assigned to each level is selected from a Dirichlet
distribution with alpha = c(20, 20, ..., 20).
A list containing:
binned |
An object of class data.frame of simulated, mixed-type data. |
cutpoints |
An object of class "list". For each feature, data type, labels (if data are
binary or categorical), and break points (if data are binary or categorical)
are stored. Cutpoints can be passed to a |
The getDataTypes
function returns a character vector listing
the type of data for each feature in a MixedTypeEngine
. the
getDaisyTypes
function converts this vector to a list suitable
for input to the daisy
function in the cluster
package.
If pCont
, pBin
, and pCat
do not sum to 1, they will be rescaled.
By default, categorical variables are simulated as a mixture of nominal
and ordinal data. The remaining categorical features described by
pNominal
are simulated as ordinal.
Kevin R. Coombes [email protected], Caitlin E. Coombes [email protected]
Wang J, Wen S, Symmans WF, Pusztai L, Coombes KR.
The bimodality index: A criterion for discovering and ranking bimodal signatures from cancer gene expression profiling data.
Cancer Inform 2009; 7:199-216.
## Create a ClinicalEngine with 4 clusters ce <- ClinicalEngine(20, 4, TRUE) ## Generate continuous data set.seed(194718) dset <- rand(ce, 300) ## Add noise before binning mixed type data cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng)) # default noisy <- blur(cnm, dset$data) ## Set the data mixture dt <- makeDataTypes(dset$data, 1/3, 1/3, 1/3, 0.3) cp <- dt$cutpoints type <- sapply(cp, function(X) { X$Type }) table(type) summary(dt$binned) ## Use the pieces from above to create an MTE. mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) ## and generate some data with the same data types and cutpoints R <- rand(mte, 20) summary(R)
## Create a ClinicalEngine with 4 clusters ce <- ClinicalEngine(20, 4, TRUE) ## Generate continuous data set.seed(194718) dset <- rand(ce, 300) ## Add noise before binning mixed type data cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng)) # default noisy <- blur(cnm, dset$data) ## Set the data mixture dt <- makeDataTypes(dset$data, 1/3, 1/3, 1/3, 0.3) cp <- dt$cutpoints type <- sapply(cp, function(X) { X$Type }) table(type) summary(dt$binned) ## Use the pieces from above to create an MTE. mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) ## and generate some data with the same data types and cutpoints R <- rand(mte, 20) summary(R)
A MixedTypeEngine combines a ClinicalEngine (which defines the combinatorics
of hits and block hyperparameters that determine cluster identities and behavior),
a stored ClinicalNoiseModel
, and cutpoints for generating mixed type
data generated by makeDataTypes
into an object that can be used to
re-generate downstream datasets with shared parameters.
MixedTypeEngine(ce, noise, cutpoints) ## S4 method for signature 'MixedTypeEngine' rand(object, n, keepall = FALSE, ...) ## S4 method for signature 'MixedTypeEngine' summary(object, ...)
MixedTypeEngine(ce, noise, cutpoints) ## S4 method for signature 'MixedTypeEngine' rand(object, n, keepall = FALSE, ...) ## S4 method for signature 'MixedTypeEngine' summary(object, ...)
ce |
Object of class |
noise |
Object of class |
cutpoints |
a list with the properties of the |
object |
object of class |
n |
a non-negative integer |
keepall |
a logical value |
... |
additional arguments for generic functions. |
The MixedTypeEngine is a device for a parameter set used to generate a simulated set of clinical data which can be used to store these parameters and to generate related datasets downstream. Building a MixedTypeEngine requires many parameters. You can supply these parameters in mutliple steps:
Construct a ClinicalEngine
.
Contruct a ClinicalNoiseModel
.
Use rand
rand to generate a "raw" data set from the
ClinicalEngine
.
Use blur
to add noise to the raw data.
Feed the noisy data into makeDataTypes
to generate a
mixed-type dataset, with cut points.
Pass the ClinicalEngine
, ClinicalNoiseModel
, and
cutpoints into the MixedTypeEngine
constructor.
The alternative method is to pass the parameters for Steps 1, 2, and 5
directly into the MixedTypeEngine
directly, as lists, and it will
carry out steps 3-5 automatically. Note, however, that instead of
passing a dataset
to be used by the makeDataTypes
function,
you instead set the value of N
to the desired number of patients
used during construction. Also, if you use the explicit steps, you
can save the intermediate data sets that are generated. If you simply
pass all of the parameters to the constructor, those intermediate data
sets are discarded, and you must generate a new data set using
rand
.
Objects can be created by a direct call to
new, though using the constructor
MixedTypeEngine
is preferred.
Generates matrix
representing clinical features of
n
patients following the
underlying distribution, noise, and data discretization pattern
captured in the object of MixedTypeEngine
. If keepall
== TRUE
, it reurns a list containing a data frame named
clinical
and three data matrices called raw
,
noisy
, and binned
. If keepall == FALSE
, then
noly the clinical
and binned
components are returned.
Prints a summary of the object.
Kevin R. Coombes [email protected], Caitlin E. Coombes [email protected]
Engine
CancerModel
CancerEngine
ClinicalNoiseModel
makeDataTypes
## Generate a Clinical Engine of continuous data ## with clusters generated from variation on the base CancerEngine ce <- ClinicalEngine(20, 4, TRUE) summary(ce) ## Generate an initial data set set.seed(194718) dset <- rand(ce, 300) class(dset) names(dset) summary(dset$clinical) dim(dset$data) ## Add noise before binning mixed type data cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng)) # default noisy <- blur(cnm, dset$data) ## Set the data mixture dt <- makeDataTypes(dset$data, 1/3, 1/3, 1/3, 0.3) ## Store the cutpoints cp <- dt$cutpoints ## Use the pieces from above to create an MTE. mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) ## Use the MTE rand method to generate ## multiple data sets with the same parameters R <- rand(mte, 20) summary(R) S <- rand(mte, 20) summary(S)
## Generate a Clinical Engine of continuous data ## with clusters generated from variation on the base CancerEngine ce <- ClinicalEngine(20, 4, TRUE) summary(ce) ## Generate an initial data set set.seed(194718) dset <- rand(ce, 300) class(dset) names(dset) summary(dset$clinical) dim(dset$data) ## Add noise before binning mixed type data cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng)) # default noisy <- blur(cnm, dset$data) ## Set the data mixture dt <- makeDataTypes(dset$data, 1/3, 1/3, 1/3, 0.3) ## Store the cutpoints cp <- dt$cutpoints ## Use the pieces from above to create an MTE. mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) ## Use the MTE rand method to generate ## multiple data sets with the same parameters R <- rand(mte, 20) summary(R) S <- rand(mte, 20) summary(S)
The MVN
class is a tool used to generate gene expressions that
follow multivariate normal distribution.
MVN(mu, Sigma, tol = 1e-06) covar(object) correl(object) ## S4 method for signature 'MVN' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'MVN' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'MVN' nrow(x) ## S4 method for signature 'MVN' rand(object, n, ...) ## S4 method for signature 'MVN' summary(object, ...)
MVN(mu, Sigma, tol = 1e-06) covar(object) correl(object) ## S4 method for signature 'MVN' alterMean(object, TRANSFORM, ...) ## S4 method for signature 'MVN' alterSD(object, TRANSFORM, ...) ## S4 method for signature 'MVN' nrow(x) ## S4 method for signature 'MVN' rand(object, n, ...) ## S4 method for signature 'MVN' summary(object, ...)
mu |
numeric vector representing k-dimensional means |
Sigma |
numeric k-by-k covariance matrix containing the measurement of the linear coupling between every pair of random vectors |
tol |
numeric scalar roundoff error that will be tolerated when assessing the singularity of the covariance matrix |
object , x
|
object of class |
TRANSFORM |
function that takes a vector of mean expression or standard deviation and returns a transformed vector that can be used to alter the appropriate slot of the object. |
n |
numeric scalar representing number of samples to be simulated |
... |
extra arguments for generic or plotting routines |
The implementation of MVN
class is designed for efficiency when
generating new samples, since we expect to do this several times.
Basically, this class separates the mvrnorm
function from the
MASS package into several steps. The computationally expensive step
(when the dimension is large) is the eigenvector decomposition of the
covariance matrix. This step is performed at construction and the
pieces are stored in the object. The rand
method for MVN
objects contains the second half of the mvrnorm
function.
Note that we typically work on expression values after taking the logarithm to some appropriate base. That is, the multivariate normal should be used on the logarithmic scale in order to contruct an engine.
The alterMean
method for an MVN
simply replaces the appropriate slot by
the transformed vector. The alterSD
method for an MVN
is trickier,
because of the way the data is stored. In order to have some hope of getting
this correct, we work in the space of the covariance matrix, Sigma.
If we let R denote the correlation matrix and let Delta be the
diagonal matrix whose entries are the individual standard deviations, then
.
So, we can change the standard deviations by replacing Delta in this
product. We then construct a new
MVN
object with the old mean vector
and the new covariance matrix.
The covar
and correl
functions, respectively, calculate
the covariance matrix and correlation matrix that underly the
covariance matrix for the objects of class MVN
. We have four
assertions as shown below, which are tested in the examples section:
covar
should return the same matrix that was used
in the function call to construct the MVN
object.
After applying an alterMean
method, the
covariance matrix is unchanged.
The diagonal of the correlation matrix consists of all ones.
After applying an alterMean
or an alterSD
method, the
correlation matrix is unchanged.
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
MVN
generator function.
mu
:numeric vector containing the k-dimensional means
lambda
:numeric vector containing the square roots of the eigenvalues of the covariance matrix
half
:numeric matrix with dimensions whose
columns contain the eigenvectors of the covariance matrix
Takes an object of class
MVN
, loops over the mu
slot, alters the mean as defined
by TRANSFORM function, and returns an object of class MVN
with
altered mu
.
Takes an object of class
MVN
, works on the diagonal matrix of the covariance matrix, alters
the standard deviation as defined by TRANSFORM function, and reconstructs
an object of class MVN
with the old mu
and
reconstructed covariance matrix.
Returns the number of genes (i.e, the length of the
mu
vector).
Generates matrix
representing gene expressions of
n
samples following the
multivariate normal distribution captured in the object of MVN
.
Prints out the number of
multivariate normal random variables in the object of MVN
.
Returns the covariance matrix of the object of
class MVN
.
Returns the correlation matrix of the object of
class MVN
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
showClass("MVN") ## Not run: tolerance <- 1e-10 ## Create a random orthogonal 2x2 matrix a <- runif(1) b <- sqrt(1-a^2) X <- matrix(c(a, b, -b, a), 2, 2) ## Now choose random positive squared-eigenvalues Lambda2 <- diag(rev(sort(rexp(2))), 2) ## Construct a covariance matrix Y <- t(X) ## Use the MVN constructor marvin <- MVN(c(0,0), Y) ## Check the four assertions print(paste('Tolerance for assertion checking:', tolerance)) print(paste('Covar assertion 1:', all(abs(covar(marvin) - Y) < tolerance))) mar2 <- alterMean(marvin, normalOffset, delta=3) print(paste('Covar assertion 2:', all(abs(covar(marvin) - covar(mar2)) < tolerance))) print(paste('Correl assertion 1:', all(abs(diag(correl(marvin)) - 1) < tolerance))) mar3 <- alterSD(marvin, function(x) 2*x) print(paste('Correl assertion 1:', all(abs(correl(marvin) - correl(mar2)) < tolerance))) rm(a, b, X, Lambda2, Y, marvin, mar2, mar3) ## End(Not run)
showClass("MVN") ## Not run: tolerance <- 1e-10 ## Create a random orthogonal 2x2 matrix a <- runif(1) b <- sqrt(1-a^2) X <- matrix(c(a, b, -b, a), 2, 2) ## Now choose random positive squared-eigenvalues Lambda2 <- diag(rev(sort(rexp(2))), 2) ## Construct a covariance matrix Y <- t(X) ## Use the MVN constructor marvin <- MVN(c(0,0), Y) ## Check the four assertions print(paste('Tolerance for assertion checking:', tolerance)) print(paste('Covar assertion 1:', all(abs(covar(marvin) - Y) < tolerance))) mar2 <- alterMean(marvin, normalOffset, delta=3) print(paste('Covar assertion 2:', all(abs(covar(marvin) - covar(mar2)) < tolerance))) print(paste('Correl assertion 1:', all(abs(diag(correl(marvin)) - 1) < tolerance))) mar3 <- alterSD(marvin, function(x) 2*x) print(paste('Correl assertion 1:', all(abs(correl(marvin) - correl(mar2)) < tolerance))) rm(a, b, X, Lambda2, Y, marvin, mar2, mar3) ## End(Not run)
nComponents
is a generic function, analogous to 'nrow' or
'ncol', that reports the number of components of an “engine“.
## S4 method for signature 'ANY' nComponents(object, ...)
## S4 method for signature 'ANY' nComponents(object, ...)
object |
an object from which the number of components is desired |
... |
additional arguments affecting the number of components produced |
Returns a nonnegative integer (scalar).
Kevin R. Coombes [email protected], Caitlin E. Coombes [email protected]
A NoiseModel
represents the additional machine noise that is layered
on top of any biological variabilty when measuring the gene expression in a
set of samples.
NoiseModel(nu, tau, phi) ## S4 method for signature 'NoiseModel' blur(object, x, ...) ## S4 method for signature 'NoiseModel' summary(object, ...)
NoiseModel(nu, tau, phi) ## S4 method for signature 'NoiseModel' blur(object, x, ...) ## S4 method for signature 'NoiseModel' summary(object, ...)
nu |
The mean value for the additive noise |
tau |
The standard deviation for the additive noise |
phi |
The standard deviation for the multiplicative noise. Note that
the mean of the multiplicative noise is set to |
object |
object of class |
x |
The data matrix containing true signal from the gene expression |
... |
extra arguments affecting the blur method applied |
We model both additive and multiplicative noise, so that the observed
expression of gene g in sample i is given by:
, where Y_gi = observed expression,
S_gi = true biological signal,
H_gi ~ N(0, phi) defines the multiplicative noise, and
E_gi ~ N(nu,tau) defines the additive noise.
Note that we allow a systematic offset/bias in the additive noise model.
Adds and multiplies random noise to the
data matrix x
containing the true signal from the gene expression.
Prints a summary of the object.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Zhang J, Coombes KR.
Sources of variation in false discovery rate estimation include
sample size, correlation, and inherent differences between groups.
BMC Bioinformatics. 2012; 13 Suppl 13:S1.
showClass("NoiseModel") nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp){ comp[[i]] <- IndependentLogNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) myData <- rand(myEngine, 5) summary(myData) nu <- 10 tau <- 20 phi <- 0.1 nm <- NoiseModel(nu, tau, phi) realData <- blur(nm, myData) summary(realData)
showClass("NoiseModel") nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp){ comp[[i]] <- IndependentLogNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) myData <- rand(myEngine, 5) summary(myData) nu <- 10 tau <- 20 phi <- 0.1 nm <- NoiseModel(nu, tau, phi) realData <- blur(nm, myData) summary(realData)
These functions are useful for simulating data that compares a homogeneous "cancer" group to a homogeneous "normal" group of samples.
NormalVsCancerModel(nBlocks, survivalModel=NULL, name="NormalVsCancer") NormalVsCancerEngine(nBlocks, hyperp)
NormalVsCancerModel(nBlocks, survivalModel=NULL, name="NormalVsCancer") NormalVsCancerEngine(nBlocks, hyperp)
nBlocks |
scalar integer representing number of correlated blocks that are differentially expressed between cancer and normal |
survivalModel |
a |
name |
character string specifying name of the object being created |
hyperp |
object of class |
The simplest simulation model assumes that we are comparing two homogeneous groups.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
BlockHyperParameters
,
CancerEngine
,
CancerModel
nvc <- NormalVsCancerModel(10) summary(nvc) rand(nvc, 10) rand(nvc, 10, balance=TRUE) engine <- NormalVsCancerEngine(10) dset <- rand(engine, 10, balance=TRUE)
nvc <- NormalVsCancerModel(10) summary(nvc) rand(nvc, 10) rand(nvc, 10, balance=TRUE) engine <- NormalVsCancerEngine(10) dset <- rand(engine, 10, balance=TRUE)
rand
is a generic function used to produce random vectors from the
distribution defined by various objects. The generic function invokes particular
methods
which depend on the class
of the first
argument.
## S4 method for signature 'ANY' rand(object, n, ...)
## S4 method for signature 'ANY' rand(object, n, ...)
object |
an object from which random numbers from a distribution is desired |
n |
numeric scalar specifying quantity of random numbers |
... |
additional arguments affecting the random numbers produced |
The form of the value returned by rand
depends on the
class of its argument. See the documentation of the particular methods
for details of what is produced by that method.
Kevin R. Coombes [email protected],
A SurvivalModel
class represents the information for simulating
survival times for each patient.
SurvivalModel(baseHazard = 1/5, accrual = 5, followUp = 1, units = 12, unitName = "months") ## S4 method for signature 'SurvivalModel' rand(object, n, beta = NULL, ...)
SurvivalModel(baseHazard = 1/5, accrual = 5, followUp = 1, units = 12, unitName = "months") ## S4 method for signature 'SurvivalModel' rand(object, n, beta = NULL, ...)
baseHazard |
numeric scalar describing the underlying hazard rate at baseline levels of covariates |
accrual |
numeric scalar representing number of patient accrual years |
followUp |
numeric scalar representing length of follow up, in years |
units |
numeric scalar representing number of units per year where
units are specified by |
unitName |
character string representing the |
object |
object of class |
n |
numeric scalar specifying quantity of random numbers |
beta |
numeric vector specifying beta parameters for patients |
... |
extra arguments for generic routines |
The SurvivalModel
generator returns an object of class
SurvivalModel
.
The rand
method returns a data.frame with components:
LFU |
time to event |
Event |
whether the data is censored |
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
SurvivalModel
generator function.
baseHazard
:see corresponding argument above
accrual
:see corresponding argument above
followUp
:see corresponding argument above
units
:see corresponding argument above
unitName
:see corresponding argument above
Simulate survival data for n
patients given beta
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
Zhang J, Coombes KR.
Sources of variation in false discovery rate estimation include
sample size, correlation, and inherent differences between groups.
BMC Bioinformatics. 2012; 13 Suppl 13:S1.
showClass("SurvivalModel") sm <- SurvivalModel() ## Generate data from base model outcome <- rand(sm, 100) summary(outcome) ## Generate data from five classes with different beta parameters beta <- rep(rnorm(5, 0, 2), each = 20) outcome <- rand(sm, 100, beta = beta) summary(outcome)
showClass("SurvivalModel") sm <- SurvivalModel() ## Generate data from base model outcome <- rand(sm, 100) summary(outcome) ## Generate data from five classes with different beta parameters beta <- rep(rnorm(5, 0, 2), each = 20) outcome <- rand(sm, 100, beta = beta) summary(outcome)
normalOffset
is a function that can be used as the
TRANSFORM
argument in an alterMean
operation, which adds
an offset to each value in the mean where the offset is chosen from a
normal distribution.
invGammaMultiple
is a function that can be used as the
TRANSFORM
argument in an alterSD
operation, which multiplies
each standard deviation by a positive value chosen from an inverse gamma
distribution with parameters shape
and scale
.
normalOffset(x, delta, sigma) invGammaMultiple(x, shape, rate)
normalOffset(x, delta, sigma) invGammaMultiple(x, shape, rate)
x |
numeric vector of mean expression or standard deviation defined in the object |
delta , sigma
|
numeric vector used as |
shape , rate
|
numeric vector used as |
normalOffset
returns a new vector, TO each element of which is added
aN offset chosen from a normal distribution with parameters mean
and sd
.
invGammaMultiple
returns a new vector, each element of which is
multiplied by a positive value chosen from an inverse gamma distribution
with parameters shape
and scale
.
Kevin R. Coombes [email protected], Jiexin Zhang [email protected],
nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) nrow(myEngine) nComponents(myEngine) summary(myEngine) myData <- rand(myEngine, 5) dim(myData) summary(myData) MEAN <- 2 SD <- 2 myEngine.alterMean <- alterMean(myEngine, function(x) normalOffset(x, MEAN, SD)) myData.alterMean <- rand(myEngine.alterMean, 5) summary(myData.alterMean) RATE <- 1 SHAPE <- 2 myEngine.alterSD <- alterSD(myEngine, function(x) invGammaMultiple(x, SHAPE, RATE)) myData.alterSD <- rand(myEngine.alterSD, 5) summary(myData.alterSD)
nComp <- 10 nGenes <- 100 comp <- list() for (i in 1:nComp) { comp[[i]] <- IndependentNormal(rnorm(nGenes/nComp, 6, 1.5), 1/rgamma(nGenes/nComp, 44, 28)) } myEngine <- Engine(comp) nrow(myEngine) nComponents(myEngine) summary(myEngine) myData <- rand(myEngine, 5) dim(myData) summary(myData) MEAN <- 2 SD <- 2 myEngine.alterMean <- alterMean(myEngine, function(x) normalOffset(x, MEAN, SD)) myData.alterMean <- rand(myEngine.alterMean, 5) summary(myData.alterMean) RATE <- 1 SHAPE <- 2 myEngine.alterSD <- alterSD(myEngine, function(x) invGammaMultiple(x, SHAPE, RATE)) myData.alterSD <- rand(myEngine.alterSD, 5) summary(myData.alterSD)