This is a simple introductory
vignette to the MAPITR
package. In this introductory
example, we will use a simulated dataset to run MAPITR
and
explore some basic output. The dataset is provided with the package.
MAPITR
and input dataFor this introductory vignette, we will be using
MAPITR_SimData_Phenotype
,
MAPITR_SimData_Genotypes
,
MAPITR_SimData_Pathways
, and
MAPITR_SimData_PCs
. These are simulated datasets created
for the purposes of this vignette. To load MAPITR
and the
datasets, run the following:
library("MAPITR")
data(MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, MAPITR_SimData_PCs)
First, we’ll take a look at the formats of our simulated input data:
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 1 2 0 1 1 0 1 0 0 2
## 2 2 0 1 2 1 2 1 1 1 2
## 3 0 0 1 2 1 1 1 1 1 0
## 4 1 2 1 0 1 2 2 0 2 1
## 5 1 0 1 2 0 0 2 2 1 1
## 6 2 1 1 1 2 0 0 0 2 1
## 7 0 1 1 1 1 1 0 1 0 0
## 8 1 1 1 1 1 2 0 0 2 1
## 9 0 2 1 0 2 2 2 1 2 0
## 10 2 2 0 2 1 0 0 1 1 0
Here you should see the first ten rows and columns of our genotype
matrix. The genotype matrix is n x p, so the rows are our n individuals
and the columns are our p SNPs. SNP genotypes, the entries of each
column, are assumed to be 0, 1, or 2 – so note, diploid information is
assumed to have been collapsed into a single value for each SNP. For
example, the typical diploid setup of 0/0, 0/1, and 1/1, which
represents the three possible combinations a given pair of reference (0)
and alternative (1) alleles can have across both chromosomal copies in
an individual, is now instead represented as 0 (0/0), 1 (0/1), or 2
(1/1). One way to derive a genotype matrix with this type of encoding is
to use PLINK’s --recodeAD
output option and remove all the
*_HET
columns afterwards.
Also note that the genotype matrix cannot have any missing genotypes
(ie 0% rate of genotype missingness). This is necessary due to how the
genetic relatedness matrices (GRMs), required components for the
MAPITR
variance component model, are constructed. For each
entry in the genotype matrix, the only acceptable values are 0, 1, and
2.
## V1
## 1 -0.4518125
## 2 -0.4628268
## 3 0.4655829
## 4 -0.3667877
## 5 0.5072770
## 6 -0.0278133
Here you should see the first ten values of our input phenotype.
MAPITR
only runs on one phenotype at a time currently, so
the main function only requires a single vector of phenotypic values as
input.
## V1
## 1 Pathway1
## 2 Pathway2
## 3 Pathway3
## 4 Pathway4
## 5 Pathway5
## V2
## 1 663,259,220,149,567,514,624,661,14,142,476,394,711,68,565,138,326,689,665,490,674,360,536,515,521,147,172,459,465,240,435,317,703,643,316,503,359,492,534,177,38,717,7,146,28,496,430,370,89,135,96,200,518,441,57,383,500,315,630,171,302,523,437,418,384,25,82,284,247,444,213,362,429,653,740,527,745,389,342,162,81,153,163,563,325,634,616,633,116,434,337,594,405,367,622,548,228,51,78,36
## 2 649,232,196,550,455,727,457,290,365,184,176,555,454,420,59,377,556,245,543,298,266,629,233,283,437,367,237,351,328,182,456,49,742,171,424,558,494,165,427,686,728,253,662,641,174,10,381,199,294,459,387,687,32,297,678,30,55,27,142,39,87,694,160,252,270,552,128,11,206,732,91,545,413,617,132,219,306,511,661,522,688,636,280,4,58,105,26,483,704,682,595,708,500,46,498,744,519,47,325,339
## 3 311,750,370,280,240,551,146,107,425,582,391,643,507,134,660,349,683,305,430,296,639,414,510,463,110,396,18,353,154,288,451,676,651,401,302,725,516,552,276,377,571,71,468,424,435,259,84,498,350,329,187,91,323,713,557,696,330,320,304,174,429,658,131,732,648,337,408,100,244,400,104,432,623,630,485,169,540,80,471,23,520,261,334,410,197,192,355,398,217,325,533,178,326,542,549,181,439,625,592,493
## 4 17,242,560,104,96,95,406,191,266,411,230,320,311,545,591,589,386,408,227,45,616,462,414,596,196,538,226,623,82,15,66,442,500,232,173,435,564,144,693,667,608,29,563,11,28,100,61,669,463,369,221,516,267,629,401,354,138,180,670,576,33,514,429,622,148,651,277,107,535,729,181,436,733,128,119,198,206,460,372,709,485,335,428,171,495,203,554,459,42,642,351,660,605,127,14,53,416,137,41,510
## 5 41,553,729,696,322,38,458,684,210,450,411,457,94,185,297,494,206,104,577,539,251,274,141,8,481,124,743,734,40,45,631,705,595,397,306,440,342,525,739,575,167,360,611,348,189,50,282,357,530,133,582,464,142,722,615,151,218,288,376,154,620,727,418,541,569,265,139,284,109,533,459,605,88,136,312,359,334,82,340,486,316,66,9,391,742,404,690,91,71,456,623,536,442,273,15,329,663,373,603,5
Here you should see a matrix with two columns. The first column
should list the names of each pathway being analyzed. The second column
should be a list of comma-separated column indices, representing the set
of SNPs belonging to each associated pathway. These indices will be used
by MAPITR
to extract the proper SNPs per pathway from the
input genotype matrix to conduct the marginal epistasis analysis. Note
that the values for this second column are not the SNP IDs (eg rsID#s)
or column headers (if your genotype matrix has them), they are the
specific column locations in the input genotype matrix for each SNP.
MAPITR
Now to run MAPITR
, we use the following code:
MAPITR_Output <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways)
And to see the output MAPITR()
provides, we can run the
following:
## Pathways pValues Est PVE
## 1 Pathway1 0.00511314 1.09663702 1.32243728
## 2 Pathway2 0.31118083 0.40971253 0.45740085
## 3 Pathway3 0.33131510 0.36199437 0.43933281
## 4 Pathway4 0.96134690 0.01706105 0.02017861
## 5 Pathway5 0.94181499 0.02552678 0.03176702
The output shown from MAPITR_Output$Results
should be a
matrix with four columns. The first column should be the list of pathway
names being analyzed (same order as input file), the second column
should be the MAPITR
p-values for each pathway, the third
column should be the estimate of the epistatic GRM variance component
(gl in
equation 3 of the MAPITR
paper) for each pathway, and the fourth
column should be the proportion of variance explained (PVE) for the
phenotype by each pathway. Note that PVE estimates here are for pathways
that generally contain overlapping variants – this impacts our
interpretation of PVE. When single variant models are used in complex
trait analysis, which is often the case, PVE can theoretically sum to 1
across all variants. A priori it is unclear what PVE should
theoretically sum to across multiple pathways due to this common
scenario where variants are shared across pathway definitions.
In this example, there should be one pathway that has a p-value below the Bonferonni-corrected threshold of .01 (.05 / 5, the number of pathways tested):
## Pathways pValues Est PVE
## 1 Pathway1 0.00511314 1.096637 1.322437
Also note this is a simulated dataset made to run quickly for this vignette – you will very likely need many more individuals and SNPs to identify an epistatic signal, and PVE should never be > 1 in real data.
Lastly, we will go over how to incorporate covariates into this test.
Our main recommendation is that most covariates should be directly
regressed out from the phenotype prior to MAPITR
analysis.
In the MAPITR
paper this
is how we correct for sex, age, and assessment center. The main
MAPITR
function also automatically regresses out the
additive effects of each pathway from the phenotype. However, we also
provide an option to remove the effects of a covariate from both the
phenotype and genotypes – specificially this is the orthogonality
projection matrix, H, used to produce equation 3 from in the
MAPITR
paper. This
projection matrix is applied to both sides of the model, thus allowing
an effect to removed both from the phenotype and the GRMs. By default a
column of ones is included in this matrix to act as a y-intercept term.
However, any additional covariates can be appended to this matrix. We
currently include top principal components into this matrix to help
remove more subtle population stratification effects from the GRMs.
To include the top ten principal components of our simulated dataset
into the MAPITR
analysis, we can run the following:
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 0.1127320 -0.4486358 0.6656510 -3.0188042 1.54672672 -0.9981512
## [2,] 1.9343189 -0.2892368 -1.3136277 0.4662708 0.04351503 -0.1170024
## [3,] 1.9908207 -4.1118289 0.5600524 -0.4923170 1.26330957 0.4864907
## [4,] -0.1753511 -0.8886352 -1.2132968 1.0306922 0.26797451 -2.0537700
## [5,] -1.4277754 -1.2106821 0.5611386 -0.9433904 1.01580351 -3.1516246
## PC7 PC8 PC9 PC10
## [1,] 0.2063473 -0.8389474 1.3046610 1.8209930
## [2,] -0.2457400 0.5114055 2.1700065 0.9436350
## [3,] 2.0332192 0.9221459 1.7075565 0.4946276
## [4,] 1.0631471 0.9807124 -0.6004786 0.1891073
## [5,] 0.7364327 -0.3032539 -1.3603440 2.8166732
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs)
MAPITR_Output2$Results[MAPITR_Output2$Results[,2] < .01,]
## Pathways pValues Est PVE
## 1 Pathway1 0.002208826 1.212191 1.479455
And after doing so, we find that including top principal components does further help refine our epistatic signal in this simulated dataset.
OpenMP versions of the MAPITR
code is available. If
possible, it is recommended to use OpenMP versions since it will allow
the code to run more quickly. Note that OpenMP must already be installed
on your system however to run these versions of the functions.
To run OpenMP versions of MAPITR
, use the same commands
as before but with the inclusion of the following option:
OpenMP = TRUE
. Examples are shown below:
MAPITR_Output1 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, OpenMP=TRUE)
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs, OpenMP=TRUE)
For further details on the MAPITR
model, please see the
associated manuscript.