TriadSim Vignette

TriadSim is a package that can simulate genotypes for case-parent triads, case-control, and quantitative trait samples with realistic linkage diequilibrium structure and allele frequency distribution. For studies of epistasis one can simulate models that involve specific SNPs at specific sets of loci, which we will refer to as “pathways”. TriadSim generates genotype data by resampling triad genotypes from existing data. It takes genotypes in PLINK format as the input files. The genotypes for the mothers, fathers, and children are in separate files. The mothers, fathers, and children must be from the same set of triad families although the ordering of the families can be different for the three sets of data. After reading in the genotypes, a sorting step will reorder the families so that the three individuals in each family can realign.

Main function TriadSim

TriadSim is the main function to perform the simulations. The example function call below simulates genotype data for 1000 case-parent triads for 4 chromosomes (chromsomes 1, 8 17, 20) under a genetic main effect scenario with a baseline disease prevalence of P0=0.001 and genetic relative risks of 1.5 and 2 for carrying the first and the second pathway respectively. This function call will write output files in PLINK. The output file names and path to the directory are given by the parameter “out.put.file” and the chromosome number. Each set (.bim, .bed and .fam files) of PLINK files contain genotype data for one chromosome for all simulated samples. The name of the file is the concatenation of the value of the input parameter “out.put.file” and chromosome number. For example, if “out.put.file” is set to be “triad”, the names of the output files will be triad1, triad8, triad17 and triad20 for our example. See R package documentation for more details.

library(TriadSim)
 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
 input.plink.file <- c(m.file, f.file, k.file)

 TriadSim(input.plink.file, out.put.file=file.path(tempdir(),'triad'), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
## [1]  21 118 121 140 155 168 218 383
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad1.fam 
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad1.bim 
## Writing BED file to /tmp/RtmpK6ezNw/triad1.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad8.fam 
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad8.bim 
## Writing BED file to /tmp/RtmpK6ezNw/triad8.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad17.fam 
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad17.bim 
## Writing BED file to /tmp/RtmpK6ezNw/triad17.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to /tmp/RtmpK6ezNw/triad20.fam 
## Writing extended MAP file to /tmp/RtmpK6ezNw/triad20.bim 
## Writing BED file to /tmp/RtmpK6ezNw/triad20.bed (SNP-major mode)

The following call simulates a quantitative trait (by setting “qtl=T”). The function will create 4 sets of plink files, one for each chromosome.

TriadSim(input.plink.file, file.path(tempdir(),‘qtl’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(0.5, 1), risk.pathway.exposed=c(0.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=T)

The following call simulates a scenario that involves gene-environment interaction. The relative risk for the exposure main effect is 1.2. The relative risks for carrying the first and second pathway SNPs are 1.5 and 2 respectively for the exposed individuals and are 1 and 1 for the unexposed individuals.

TriadSim(input.plink.file, file.path(tempdir(),‘gxe’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=0.3, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=FALSE)

The following call simulates a stratified scenario that involves gene-environment interaction. The risk parameters are the same as the scenario above. The “input.plink.file”” is a list of two character vectors. Each vector contains three character strings giving the directory and basename of the PLINK files in one subpopulation.The subpopulations are equally sized (pop1.frac=0.5). The baseline disease prevalence (disease prevalence in the unexposed who carries 0 copy of the risk pathway) is 0.001 in the first subpopulation while that in the second subpopulation is 0.003 (0.001*3). The exposure prevalence in the two subpopulations are 0.1 and 0.3 respectively.

library(TriadSim) m.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_mom’) f.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_dad’) k.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_kid’) m.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_mom’) f.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_dad’) k.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_kid’) input.plink.file2 <- list(c(m.file, f.file, k.file),c(m.file2, f.file2, k.file2))

TriadSim(input.plink.file2, out.put.file=file.path(tempdir(),‘stratified’) , fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=c(0.1, 0.3), pop1.frac=0.5, P0.ratio=3,rcmb.rate, no_cores=1)

To simulate case-control data the function needs to be called twice, calls to simulate cases (is.case=TRUE) and controls (is.case=FALSE) respectively. The script below calls the function to simulate 1000 cases and 1000 controls and writes genotypes of the cases and controls into seperate sets of PLINK files.

For example, use the following to create cases:

TriadSim(input.plink.file,file.path(tempdir(),‘case’) , fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

And use the following to create controls: TriadSim(input.plink.file, file.path(tempdir(),‘ctrl’), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=FALSE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

Some additional details

The source data may contain genotyping errors that cause non-Mendelian inheritance patterns. For these non-Mendelian families, genotypes of the three individuals in the family will be set to missing at the corresponding SNPs. We assume nonpaternity and adoption have both been ruled out in QC for the source data.

This function requires at least two pathway SNPs, eithe two SNPs in the one pathway or two pathways each involving one SNP. If the users are interested in a single SNP scenario one can trick the function by setting the number of pathway to 2, each with a single SNP in the pathway but only the SNP in the first pathway carries a risk while that in the second pathway does not change risk. For example:

TriadSim(input.plink.file, file.path(tempdir(),‘singleSNP’), fr.desire=0.05,pathways=list(1,2),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 1), risk.pathway.exposed=c(1.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

##Facility functions

The following set of functions is provided in case users want to have more control over the simulation parameters. They are called by the main function to generate simulations. Users do not need to call them directly.

###pick_target.snp Users can manually pick the target SNPs in the pathway or use the facility function pick_target.snp to pick the set of target SNPs in the pathway(s) based on a desired allele frequency. The example below uses the example files that come with the package to select 8 SNPs with allele frequencies close to 0.05. The function returns the selected target SNPs by giving the row numbers (i.e., the order) of the corresponding SNPs among all the SNPs in the associated “bim” file. For example a return of “1084 2044 3285 4016 5117 6067 7077 8187” means the SNPs at rows 1084 2044 3285 4016 5117 6067 7077 8187 are selected to be the target SNPs in the pathway.

 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 picked.target <- pick_target.snp(c(m.file,f.file),0.05, 8)
## [1]  21 118 121 140 155 168 218 383
 cat('Target SNPs picked:',picked.target[[2]],'\n')
## Target SNPs picked: 21 118 121 140 155 168 218 383

###get.target.geno The function get.target.geno retrieves genotypes of the target SNPs and returns the genotypes of the triads in a list of three elements: the observed genotypes of the mothers from family 1 to family n repeated twice, genotypes of the fathers from family 1 to family n repeated twice and genotypes of children from family 1 to n followed by (stacking on top of) genotypes of the complements in the same family order.

target.snp <- picked.target[[2]]
 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
target.geno <- get.target.geno(c(m.file,f.file,k.file), target.snp,snp.all2)

The output target.geno is a list of three elements, each being a matrix of genotypes

To increase diversity, TriadSim introduces break points at each chromosome, selecting them independently for each triad being simulated. The break points can be picked manually or using the function get.brks. The function tends to pick the break points at recombination hotspots if such data are passed in as an input parameter rcmb.rate.

found.brks <- get.brks(N.brk=3,n.ped=1000, snp.all2, target.snp,rcmb.rate=rcmb.rate)
breaks <- found.brks[[1]]
family.position <- found.brks[[2]] 

This function returns a list of two items. The first is a 1000 x 17 matrix of integers showing where the chromosomal breaks are to take place (in terms of the order of the SNPs in the PLINK files) for each individual in the simulated trios. Each chromosome has 3 breaks, adding to that is the number of breaks between chromosomes, i.e., 3, and the first and the last SNPs, and this is where the 17 comes from. Here 1000 denotes the number of triads in the simulated data as defined by the n.ped input parameter.

dim(breaks)
## [1] 1000   19
head(breaks)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0   46  120  129  154  161  173  194  233   262   297   309   324   341
## [2,]    0   35  119  137  147  161  173  183  246   264   297   300   326   337
## [3,]    0   35  118  133  143  162  173  204  233   262   297   306   329   350
## [4,]    0   23  120  130  147  161  173  197  249   264   297   300   326   337
## [5,]    0   40  118  123  146  161  173  183  250   267   297   300   326   338
## [6,]    0   95  118  136  151  162  173  183  250   257   297   311   317   337
##      [,15] [,16] [,17] [,18] [,19]
## [1,]   355   369   377   398   412
## [2,]   355   369   385   398   412
## [3,]   355   372   377   408   412
## [4,]   355   369   377   408   412
## [5,]   355   372   376   407   412
## [6,]   355   369   388   408   412

The second one is a 1000 x 8 matrix showing the chromosomal segments out of which each target SNP is selected for each simulated trio.

dim(family.position)
## [1] 1000    8
head(family.position)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    2    3    4    5    6    8   17
## [2,]    1    2    3    4    5    6    8   16
## [3,]    1    2    3    4    5    6    8   17
## [4,]    1    2    3    4    5    6    8   17
## [5,]    1    2    3    4    5    6    8   17
## [6,]    1    2    3    4    5    6    8   16