MAPITR Introduction – Simulated Data

This is a simple introductory vignette to the MAPITR package. In this introductory example, we will use a simulated dataset to run MAPITR and explore some basic output. The dataset is provided with the package.

Loading MAPITR and input data

For this introductory vignette, we will be using MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, and MAPITR_SimData_PCs. These are simulated datasets created for the purposes of this vignette. To load MAPITR and the datasets, run the following:

library("MAPITR")

data(MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, MAPITR_SimData_PCs)

First, we’ll take a look at the formats of our simulated input data:

MAPITR_SimData_Genotypes[1:10,1:10]
##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1   1  2  0  1  1  0  1  0  0   2
## 2   2  0  1  2  1  2  1  1  1   2
## 3   0  0  1  2  1  1  1  1  1   0
## 4   1  2  1  0  1  2  2  0  2   1
## 5   1  0  1  2  0  0  2  2  1   1
## 6   2  1  1  1  2  0  0  0  2   1
## 7   0  1  1  1  1  1  0  1  0   0
## 8   1  1  1  1  1  2  0  0  2   1
## 9   0  2  1  0  2  2  2  1  2   0
## 10  2  2  0  2  1  0  0  1  1   0

Here you should see the first ten rows and columns of our genotype matrix. The genotype matrix is n x p, so the rows are our n individuals and the columns are our p SNPs. SNP genotypes, the entries of each column, are assumed to be 0, 1, or 2 – so note, diploid information is assumed to have been collapsed into a single value for each SNP. For example, the typical diploid setup of 0/0, 0/1, and 1/1, which represents the three possible combinations a given pair of reference (0) and alternative (1) alleles can have across both chromosomal copies in an individual, is now instead represented as 0 (0/0), 1 (0/1), or 2 (1/1). One way to derive a genotype matrix with this type of encoding is to use PLINK’s --recodeAD output option and remove all the *_HET columns afterwards.

Also note that the genotype matrix cannot have any missing genotypes (ie 0% rate of genotype missingness). This is necessary due to how the genetic relatedness matrices (GRMs), required components for the MAPITR variance component model, are constructed. For each entry in the genotype matrix, the only acceptable values are 0, 1, and 2.

head(MAPITR_SimData_Phenotype)
##           V1
## 1 -0.4518125
## 2 -0.4628268
## 3  0.4655829
## 4 -0.3667877
## 5  0.5072770
## 6 -0.0278133

Here you should see the first ten values of our input phenotype. MAPITR only runs on one phenotype at a time currently, so the main function only requires a single vector of phenotypic values as input.

head(MAPITR_SimData_Pathways)
##         V1
## 1 Pathway1
## 2 Pathway2
## 3 Pathway3
## 4 Pathway4
## 5 Pathway5
##                                                                                                                                                                                                                                                                                                                                                                                                          V2
## 1          663,259,220,149,567,514,624,661,14,142,476,394,711,68,565,138,326,689,665,490,674,360,536,515,521,147,172,459,465,240,435,317,703,643,316,503,359,492,534,177,38,717,7,146,28,496,430,370,89,135,96,200,518,441,57,383,500,315,630,171,302,523,437,418,384,25,82,284,247,444,213,362,429,653,740,527,745,389,342,162,81,153,163,563,325,634,616,633,116,434,337,594,405,367,622,548,228,51,78,36
## 2            649,232,196,550,455,727,457,290,365,184,176,555,454,420,59,377,556,245,543,298,266,629,233,283,437,367,237,351,328,182,456,49,742,171,424,558,494,165,427,686,728,253,662,641,174,10,381,199,294,459,387,687,32,297,678,30,55,27,142,39,87,694,160,252,270,552,128,11,206,732,91,545,413,617,132,219,306,511,661,522,688,636,280,4,58,105,26,483,704,682,595,708,500,46,498,744,519,47,325,339
## 3 311,750,370,280,240,551,146,107,425,582,391,643,507,134,660,349,683,305,430,296,639,414,510,463,110,396,18,353,154,288,451,676,651,401,302,725,516,552,276,377,571,71,468,424,435,259,84,498,350,329,187,91,323,713,557,696,330,320,304,174,429,658,131,732,648,337,408,100,244,400,104,432,623,630,485,169,540,80,471,23,520,261,334,410,197,192,355,398,217,325,533,178,326,542,549,181,439,625,592,493
## 4           17,242,560,104,96,95,406,191,266,411,230,320,311,545,591,589,386,408,227,45,616,462,414,596,196,538,226,623,82,15,66,442,500,232,173,435,564,144,693,667,608,29,563,11,28,100,61,669,463,369,221,516,267,629,401,354,138,180,670,576,33,514,429,622,148,651,277,107,535,729,181,436,733,128,119,198,206,460,372,709,485,335,428,171,495,203,554,459,42,642,351,660,605,127,14,53,416,137,41,510
## 5             41,553,729,696,322,38,458,684,210,450,411,457,94,185,297,494,206,104,577,539,251,274,141,8,481,124,743,734,40,45,631,705,595,397,306,440,342,525,739,575,167,360,611,348,189,50,282,357,530,133,582,464,142,722,615,151,218,288,376,154,620,727,418,541,569,265,139,284,109,533,459,605,88,136,312,359,334,82,340,486,316,66,9,391,742,404,690,91,71,456,623,536,442,273,15,329,663,373,603,5

Here you should see a matrix with two columns. The first column should list the names of each pathway being analyzed. The second column should be a list of comma-separated column indices, representing the set of SNPs belonging to each associated pathway. These indices will be used by MAPITR to extract the proper SNPs per pathway from the input genotype matrix to conduct the marginal epistasis analysis. Note that the values for this second column are not the SNP IDs (eg rsID#s) or column headers (if your genotype matrix has them), they are the specific column locations in the input genotype matrix for each SNP.

Running MAPITR

Now to run MAPITR, we use the following code:

MAPITR_Output <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways)

And to see the output MAPITR() provides, we can run the following:

head(MAPITR_Output$Results)
##   Pathways    pValues        Est        PVE
## 1 Pathway1 0.00511314 1.09663702 1.32243728
## 2 Pathway2 0.31118083 0.40971253 0.45740085
## 3 Pathway3 0.33131510 0.36199437 0.43933281
## 4 Pathway4 0.96134690 0.01706105 0.02017861
## 5 Pathway5 0.94181499 0.02552678 0.03176702

The output shown from MAPITR_Output$Results should be a matrix with four columns. The first column should be the list of pathway names being analyzed (same order as input file), the second column should be the MAPITR p-values for each pathway, the third column should be the estimate of the epistatic GRM variance component (gl in equation 3 of the MAPITR paper) for each pathway, and the fourth column should be the proportion of variance explained (PVE) for the phenotype by each pathway. Note that PVE estimates here are for pathways that generally contain overlapping variants – this impacts our interpretation of PVE. When single variant models are used in complex trait analysis, which is often the case, PVE can theoretically sum to 1 across all variants. A priori it is unclear what PVE should theoretically sum to across multiple pathways due to this common scenario where variants are shared across pathway definitions.

In this example, there should be one pathway that has a p-value below the Bonferonni-corrected threshold of .01 (.05 / 5, the number of pathways tested):

MAPITR_Output$Results[MAPITR_Output$Results[,2] < .01,]
##   Pathways    pValues      Est      PVE
## 1 Pathway1 0.00511314 1.096637 1.322437

Also note this is a simulated dataset made to run quickly for this vignette – you will very likely need many more individuals and SNPs to identify an epistatic signal, and PVE should never be > 1 in real data.

Incorporating covariates

Lastly, we will go over how to incorporate covariates into this test. Our main recommendation is that most covariates should be directly regressed out from the phenotype prior to MAPITR analysis. In the MAPITR paper this is how we correct for sex, age, and assessment center. The main MAPITR function also automatically regresses out the additive effects of each pathway from the phenotype. However, we also provide an option to remove the effects of a covariate from both the phenotype and genotypes – specificially this is the orthogonality projection matrix, H, used to produce equation 3 from in the MAPITR paper. This projection matrix is applied to both sides of the model, thus allowing an effect to removed both from the phenotype and the GRMs. By default a column of ones is included in this matrix to act as a y-intercept term. However, any additional covariates can be appended to this matrix. We currently include top principal components into this matrix to help remove more subtle population stratification effects from the GRMs.

To include the top ten principal components of our simulated dataset into the MAPITR analysis, we can run the following:

MAPITR_SimData_PCs[1:5,1:10]
##             PC1        PC2        PC3        PC4        PC5        PC6
## [1,]  0.1127320 -0.4486358  0.6656510 -3.0188042 1.54672672 -0.9981512
## [2,]  1.9343189 -0.2892368 -1.3136277  0.4662708 0.04351503 -0.1170024
## [3,]  1.9908207 -4.1118289  0.5600524 -0.4923170 1.26330957  0.4864907
## [4,] -0.1753511 -0.8886352 -1.2132968  1.0306922 0.26797451 -2.0537700
## [5,] -1.4277754 -1.2106821  0.5611386 -0.9433904 1.01580351 -3.1516246
##             PC7        PC8        PC9      PC10
## [1,]  0.2063473 -0.8389474  1.3046610 1.8209930
## [2,] -0.2457400  0.5114055  2.1700065 0.9436350
## [3,]  2.0332192  0.9221459  1.7075565 0.4946276
## [4,]  1.0631471  0.9807124 -0.6004786 0.1891073
## [5,]  0.7364327 -0.3032539 -1.3603440 2.8166732
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs)
MAPITR_Output2$Results[MAPITR_Output2$Results[,2] < .01,]
##   Pathways     pValues      Est      PVE
## 1 Pathway1 0.002208826 1.212191 1.479455

And after doing so, we find that including top principal components does further help refine our epistatic signal in this simulated dataset.

Implementing OpenMP

OpenMP versions of the MAPITR code is available. If possible, it is recommended to use OpenMP versions since it will allow the code to run more quickly. Note that OpenMP must already be installed on your system however to run these versions of the functions.

To run OpenMP versions of MAPITR, use the same commands as before but with the inclusion of the following option: OpenMP = TRUE. Examples are shown below:

MAPITR_Output1 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, OpenMP=TRUE)
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs, OpenMP=TRUE)

For further details on the MAPITR model, please see the associated manuscript.