Main function TriadSim
TriadSim
is the main function to perform the
simulations. The example function call below simulates genotype data for
1000 case-parent triads for 4 chromosomes (chromsomes 1, 8 17, 20) under
a genetic main effect scenario with a baseline disease prevalence of
P0=0.001 and genetic relative risks of 1.5 and 2 for carrying the first
and the second pathway respectively. This function call will write
output files in PLINK. The output file names and path to the directory
are given by the parameter “out.put.file” and the chromosome number.
Each set (.bim, .bed and .fam files) of PLINK files contain genotype
data for one chromosome for all simulated samples. The name of the file
is the concatenation of the value of the input parameter “out.put.file”
and chromosome number. For example, if “out.put.file” is set to be
“triad”, the names of the output files will be triad1, triad8, triad17
and triad20 for our example. See R package documentation for more
details.
library(TriadSim)
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
input.plink.file <- c(m.file, f.file, k.file)
TriadSim(input.plink.file, out.put.file=file.path(tempdir(),'triad'), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
## [1] 21 118 121 140 155 168 218 383
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad1.fam
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad1.bim
## Writing BED file to /tmp/RtmpK6ezNw/triad1.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad8.fam
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad8.bim
## Writing BED file to /tmp/RtmpK6ezNw/triad8.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad17.fam
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad17.bim
## Writing BED file to /tmp/RtmpK6ezNw/triad17.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad20.fam
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad20.bim
## Writing BED file to /tmp/RtmpK6ezNw/triad20.bed (SNP-major mode)
The following call simulates a quantitative trait (by setting “qtl=T”). The function will create 4 sets of plink files, one for each chromosome.
TriadSim(input.plink.file, file.path(tempdir(),‘qtl’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(0.5, 1), risk.pathway.exposed=c(0.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=T)
The following call simulates a scenario that involves gene-environment interaction. The relative risk for the exposure main effect is 1.2. The relative risks for carrying the first and second pathway SNPs are 1.5 and 2 respectively for the exposed individuals and are 1 and 1 for the unexposed individuals.
TriadSim(input.plink.file, file.path(tempdir(),‘gxe’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=0.3, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=FALSE)
The following call simulates a stratified scenario that involves gene-environment interaction. The risk parameters are the same as the scenario above. The “input.plink.file”” is a list of two character vectors. Each vector contains three character strings giving the directory and basename of the PLINK files in one subpopulation.The subpopulations are equally sized (pop1.frac=0.5). The baseline disease prevalence (disease prevalence in the unexposed who carries 0 copy of the risk pathway) is 0.001 in the first subpopulation while that in the second subpopulation is 0.003 (0.001*3). The exposure prevalence in the two subpopulations are 0.1 and 0.3 respectively.
library(TriadSim) m.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_mom’) f.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_dad’) k.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_kid’) m.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_mom’) f.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_dad’) k.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_kid’) input.plink.file2 <- list(c(m.file, f.file, k.file),c(m.file2, f.file2, k.file2))
TriadSim(input.plink.file2, out.put.file=file.path(tempdir(),‘stratified’) , fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=c(0.1, 0.3), pop1.frac=0.5, P0.ratio=3,rcmb.rate, no_cores=1)
To simulate case-control data the function needs to be called twice, calls to simulate cases (is.case=TRUE) and controls (is.case=FALSE) respectively. The script below calls the function to simulate 1000 cases and 1000 controls and writes genotypes of the cases and controls into seperate sets of PLINK files.
For example, use the following to create cases:
TriadSim(input.plink.file,file.path(tempdir(),‘case’) , fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
And use the following to create controls: TriadSim(input.plink.file, file.path(tempdir(),‘ctrl’), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=FALSE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
Some additional details
The source data may contain genotyping errors that cause non-Mendelian inheritance patterns. For these non-Mendelian families, genotypes of the three individuals in the family will be set to missing at the corresponding SNPs. We assume nonpaternity and adoption have both been ruled out in QC for the source data.
This function requires at least two pathway SNPs, eithe two SNPs in the one pathway or two pathways each involving one SNP. If the users are interested in a single SNP scenario one can trick the function by setting the number of pathway to 2, each with a single SNP in the pathway but only the SNP in the first pathway carries a risk while that in the second pathway does not change risk. For example:
TriadSim(input.plink.file, file.path(tempdir(),‘singleSNP’), fr.desire=0.05,pathways=list(1,2),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 1), risk.pathway.exposed=c(1.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
##Facility functions
The following set of functions is provided in case users want to have more control over the simulation parameters. They are called by the main function to generate simulations. Users do not need to call them directly.
###pick_target.snp
Users can manually pick the target
SNPs in the pathway or use the facility function
pick_target.snp
to pick the set of target SNPs in the
pathway(s) based on a desired allele frequency. The example below uses
the example files that come with the package to select 8 SNPs with
allele frequencies close to 0.05. The function returns the selected
target SNPs by giving the row numbers (i.e., the order) of the
corresponding SNPs among all the SNPs in the associated “bim” file. For
example a return of “1084 2044 3285 4016 5117 6067 7077 8187” means the
SNPs at rows 1084 2044 3285 4016 5117 6067 7077 8187 are selected to be
the target SNPs in the pathway.
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
picked.target <- pick_target.snp(c(m.file,f.file),0.05, 8)
## [1] 21 118 121 140 155 168 218 383
## Target SNPs picked: 21 118 121 140 155 168 218 383
###get.target.geno
The function
get.target.geno
retrieves genotypes of the target SNPs and
returns the genotypes of the triads in a list of three elements: the
observed genotypes of the mothers from family 1 to family n repeated
twice, genotypes of the fathers from family 1 to family n repeated twice
and genotypes of children from family 1 to n followed by (stacking on
top of) genotypes of the complements in the same family order.
target.snp <- picked.target[[2]]
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
target.geno <- get.target.geno(c(m.file,f.file,k.file), target.snp,snp.all2)
The output target.geno is a list of three elements, each being a matrix of genotypes
To increase diversity, TriadSim introduces break points at each
chromosome, selecting them independently for each triad being simulated.
The break points can be picked manually or using the function
get.brks
. The function tends to pick the break points at
recombination hotspots if such data are passed in as an input parameter
rcmb.rate.
found.brks <- get.brks(N.brk=3,n.ped=1000, snp.all2, target.snp,rcmb.rate=rcmb.rate)
breaks <- found.brks[[1]]
family.position <- found.brks[[2]]
This function returns a list of two items. The first is a 1000 x 17
matrix of integers showing where the chromosomal breaks are to take
place (in terms of the order of the SNPs in the PLINK files) for each
individual in the simulated trios. Each chromosome has 3 breaks, adding
to that is the number of breaks between chromosomes, i.e., 3, and the
first and the last SNPs, and this is where the 17 comes from. Here 1000
denotes the number of triads in the simulated data as defined by the
n.ped
input parameter.
## [1] 1000 19
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 46 120 129 154 161 173 194 233 262 297 309 324 341
## [2,] 0 35 119 137 147 161 173 183 246 264 297 300 326 337
## [3,] 0 35 118 133 143 162 173 204 233 262 297 306 329 350
## [4,] 0 23 120 130 147 161 173 197 249 264 297 300 326 337
## [5,] 0 40 118 123 146 161 173 183 250 267 297 300 326 338
## [6,] 0 95 118 136 151 162 173 183 250 257 297 311 317 337
## [,15] [,16] [,17] [,18] [,19]
## [1,] 355 369 377 398 412
## [2,] 355 369 385 398 412
## [3,] 355 372 377 408 412
## [4,] 355 369 377 408 412
## [5,] 355 372 376 407 412
## [6,] 355 369 388 408 412
The second one is a 1000 x 8 matrix showing the chromosomal segments out of which each target SNP is selected for each simulated trio.
## [1] 1000 8
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 8 17
## [2,] 1 2 3 4 5 6 8 16
## [3,] 1 2 3 4 5 6 8 17
## [4,] 1 2 3 4 5 6 8 17
## [5,] 1 2 3 4 5 6 8 17
## [6,] 1 2 3 4 5 6 8 16