---
title: "MAPITR Introduction -- Simulated Data"
author: "Michael C Turchin"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MAPITR Introduction -- Simulated Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This is a simple introductory vignette to the `MAPITR` package. In this introductory example, we will use a simulated dataset to run `MAPITR` and explore some basic output. The dataset is provided with the package.

## Loading `MAPITR` and input data

For this introductory vignette, we will be using `MAPITR_SimData_Phenotype`, `MAPITR_SimData_Genotypes`, `MAPITR_SimData_Pathways`, and `MAPITR_SimData_PCs`. These are simulated datasets created for the purposes of this vignette. To load `MAPITR` and the datasets, run the following:

```{r}
library("MAPITR")

data(MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, MAPITR_SimData_PCs)
```

First, we'll take a look at the formats of our simulated input data:

```{r}
MAPITR_SimData_Genotypes[1:10,1:10]
```

Here you should see the first ten rows and columns of our genotype matrix. The genotype matrix is n x p, so the rows are our n individuals and the columns are our p SNPs. SNP genotypes, the entries of each column, are assumed to be 0, 1, or 2 -- so note, diploid information is assumed to have been collapsed into a single value for each SNP. For example, the typical diploid setup of 0/0, 0/1, and 1/1, which represents the three possible combinations a given pair of reference (0) and alternative (1) alleles can have across both chromosomal copies in an individual, is now instead represented as 0 (0/0), 1 (0/1), or 2 (1/1). One way to derive a genotype matrix with this type of encoding is to use PLINK's `--recodeAD` output option and remove all the `*_HET` columns afterwards.

Also note that the genotype matrix cannot have any missing genotypes (ie 0% rate of genotype missingness). This is necessary due to how the genetic relatedness matrices (GRMs), required components for the `MAPITR` variance component model, are constructed. For each entry in the genotype matrix, the only acceptable values are 0, 1, and 2. 

```{r}
head(MAPITR_SimData_Phenotype)
```

Here you should see the first ten values of our input phenotype. `MAPITR` only runs on one phenotype at a time currently, so the main function only requires a single vector of phenotypic values as input. 


```{r}
head(MAPITR_SimData_Pathways)
```

Here you should see a matrix with two columns. The first column should list the names of each pathway being analyzed. The second column should be a list of comma-separated column indices, representing the set of SNPs belonging to each associated pathway. These indices will be used by `MAPITR` to extract the proper SNPs per pathway from the input genotype matrix to conduct the marginal epistasis analysis. Note that the values for this second column are not the SNP IDs (eg rsID#s) or column headers (if your genotype matrix has them), they are the specific column locations in the input genotype matrix for each SNP.

## Running `MAPITR`

Now to run `MAPITR`, we use the following code:

```{r}
MAPITR_Output <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways)
```

And to see the output `MAPITR()` provides, we can run the following:

```{r}
head(MAPITR_Output$Results)
```

The output shown from `MAPITR_Output$Results` should be a matrix with four columns. The first column should be the list of pathway names being analyzed (same order as input file), the second column should be the `MAPITR` p-values for each pathway, the third column should be the estimate of the epistatic GRM variance component ($g_l$ in equation 3 of the `MAPITR` [paper][MAPITR-paper]) for each pathway, and the fourth column should be the proportion of variance explained (PVE) for the phenotype by each pathway. Note that PVE estimates here are for pathways that generally contain overlapping variants -- this impacts our interpretation of PVE. When single variant models are used in complex trait analysis, which is often the case, PVE can theoretically sum to 1 across all variants. *A priori* it is unclear what PVE should theoretically sum to across multiple pathways due to this common scenario where variants are shared across pathway definitions.

In this example, there should be one pathway that has a p-value below the Bonferonni-corrected threshold of .01 (.05 / 5, the number of pathways tested):

```{r}
MAPITR_Output$Results[MAPITR_Output$Results[,2] < .01,]
```

Also note this is a simulated dataset made to run quickly for this vignette -- you will very likely need many more individuals and SNPs to identify an epistatic signal, and PVE should never be > 1 in real data.

## Incorporating covariates

Lastly, we will go over how to incorporate covariates into this test. Our main recommendation is that most covariates should be directly regressed out from the phenotype prior to `MAPITR` analysis. In the `MAPITR` [paper][MAPITR-paper] this is how we correct for sex, age, and assessment center. The main `MAPITR` function also automatically regresses out the additive effects of each pathway from the phenotype. However, we also provide an option to remove the effects of a covariate from both the phenotype and genotypes -- specificially this is the orthogonality projection matrix, H, used to produce equation 3 from in the `MAPITR` [paper][MAPITR-paper]. This projection matrix is applied to both sides of the model, thus allowing an effect to removed both from the phenotype and the GRMs. By default a column of ones is included in this matrix to act as a y-intercept term. However, any additional covariates can be appended to this matrix. We currently include top principal components into this matrix to help remove more subtle population stratification effects from the GRMs.

To include the top ten principal components of our simulated dataset into the `MAPITR` analysis, we can run the following: 
```{r}
MAPITR_SimData_PCs[1:5,1:10]
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs)
MAPITR_Output2$Results[MAPITR_Output2$Results[,2] < .01,]
```

And after doing so, we find that including top principal components does further help refine our epistatic signal in this simulated dataset.

## Implementing OpenMP

OpenMP versions of the `MAPITR` code is available. If possible, it is recommended to use OpenMP versions since it will allow the code to run more quickly. Note that OpenMP must already be installed on your system however to run these versions of the functions. 

To run OpenMP versions of `MAPITR`, use the same commands as before but with the inclusion of the following option: `OpenMP = TRUE`. Examples are shown below:
```
MAPITR_Output1 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, OpenMP=TRUE)
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs, OpenMP=TRUE)
```


For further details on the `MAPITR` model, please see the associated [manuscript][MAPITR-paper]. 

[MAPITR-paper]: http://biorxiv.org/