---
title: "GESE package vignette"
author: "Dandi Qiao"
date: "`r Sys.Date()`"
output: 
  pdf_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{An R package for Gene-based Segregation test (GESE)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


#1 Introduction
This tutorial briefly introduces the functions provided by the GESE package, using the example data included in the package. The paper describing this method is "Qiao, D. Lange, C., Laird, N.M., Won, S.,  Hersh, C.P., et al. 2017. Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319". We have also implemented a simple pipeline to compute the GESE p-values using this R library, which is described in detail at [http://scholar.harvard.edu/dqiao/gese](http://scholar.harvard.edu/dqiao/gese).


We can load the library using:

```{r}
library(GESE)
```

#2 Example data

A randomly simulated example data is included in the package. This data includes 198 sequenced subjects in 50 families. Only 2 genes with 10 variants each are included for these subjects. With real sequencing data, the first step is to filter down to the rare and functionally important variants since we hypothesize that there are rare causal variants of large effects for Mendelian diseases, or Mendelian-subtypes of complex diseases. One way is to filter the study data and the reference data using MAF < 0.1% and CADD score^[Kircher, M, Witten, DM, Jain, P, O’Roak, BJ, Cooper, GM, and Shendure, J. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics 2014; 46(3):310–315.] > 15 on LoF (and missense) variants. Other annotation tools such as SIFT and Polyphen2 could also be helpful in filtering down to the functional variants.

After the filtering step, we can load the variant data of the sequenced subjects. This data frame is in the PLINK raw format (with the default header generated by PLINK). We need to make sure that the genotype is the number of the minor alleles in the corresponding population.
```{r}
data(dataRaw)
dim(dataRaw)
dataRaw[1:10,]
length(unique(dataRaw$FID))
```

The corresponding map file with variant information can be loaded next. The order of the variants needs to match the order in the dataRaw file above.
```{r}
data(mapInfo)
mapInfo[1:3, ]
dim(mapInfo)
```

The complete pedigree information for the 50 families can be loaded next:
```{r}
data(pednew)
dim(pednew)
pednew[1:5,]
```

The complete variant information for the corresponding population can be loaded next. This reference database is used to compute the background variation in the corresponding population.
```{r}
data(database)
dim(database)
database[1:5,]
```

#3 Functions

The main function for obtaining gene-based segregation information and (unweighted and weighted) segregation tests is `GESE`. Other functions that may be helpful in analyzing family-based sequencing data will be discussed too.

## 3.1 Compute variant-based and gene-based segregation information for different mode of inheritance.
One major function in this package for computing segregation information for different mode of inheritance is `getSegInfo`. This function returns the variant-based and gene-based segregation information for each family and the total number of segregating families for three different mode of inheritance: dominant, recessive, and compound heterozygous.

For example, segregation in the family with dominant mode of inheritance means the variant is present in all the cases in the family, and absent in all the controls in the family. Therefore variants that are segregating in multiple families are more likely to be causal.

Since it is likely that different rare variants are influencing the disease susceptibility in different families, collapsing the variants into genes may give us more power to detect the causal genes. Therefore we also need to compute the total number of families that are segregating in each gene.

For recessive mode of inheritance, segregation means two copies of the alternative alleles are present in all the cases in the family, and less than two copies in all the controls of the family (`varSeg`). This information can also be collapsed for each gene (`geneSeg`). 

For compound heterozygous (CH) mode of inheritance, segregation at two variants means the alternative alleles are present at both loci for all the cases in the family, and absent in at least one locus in all the controls in the family (`varSeg`). This information can be collapsed for each gene, where only pairs of variants in the same gene are considred (`geneSeg`). This can also be collapsed for each pair of genes, where pairs of variants from each of the two genes are considered (`genePairSeg`).

The data frame `varSeg` is a matrix containing logical value (TRUE or FALSE) for each variant (each row) and each family (each column). The first column is the variant ID. The last column `numSegFam` is the total number of families the variant is segregating in. The data frame `geneSeg` is also a matrix containing logical values, for each gene and each family. TRUE value means that at least one variant in this gene is segregating in the family.

We demonstrate the use of this function below.

To compute segregation information for domiant mode of inheritance:
```{r, results="hide"}
results <- getSegInfo(pednew,  dataRaw, mapInfo, mode="dominant")
results
```

```{r, echo=FALSE}
results
```

To compute segregation information for recessive mode of inheritance:
```{r, results="hide"}
results <- getSegInfo(pednew,  dataRaw, mapInfo)
results
```

```{r, echo=FALSE}
results
```


To compute segregation information for compound heterozygous mode of inheritance:
```{r, results="hide"}
results <- getSegInfo(pednew,  dataRaw, mapInfo, mode="CH")
results
```

```{r, echo=FALSE}
results
```
The `geneSeg` or `genePairSeg` return NA values, because there is no pair of variant in any gene, or gene pair that is segregating in any family, with compound heterozygous mode of inheritance.

Alternatively, we can compute segregation with dominant mode of inheritance wthout computing any probabilities using the `GESE` function and specify `onlySeg` to be TRUE:
```{r, results="hide"}
results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5,
                onlySeg=TRUE)
results
```

```{r, echo=FALSE}
results
```
Similarly, the `segregation` value returned is a data matrix, where a logical value (TRUE or FALSE) is returned for each gene(each row) and each family (column names). The last column `numSegFam` sums each row, which gives the total number of pedigrees each gene segregates in. The families were first trimmed to satisfy the assumption of GESE, so that only one founder case is present for each pedigree.

The `varSeg` value returned is a similar data matrix, but returns a logical value for each variant (each row) and each family (column names).

## 3.2 Compute gene-based segregation test.
The gene-based segregation information may be helpful, however, it does not take into account the different family structure among the families, nor the different genetic background of different genes. The gene-based segregation test combines this information by approximating the marginal probability of segregation events among the families. 

To compute the p-value for this test, there are a few steps in the process. 

- First, as we mentioned, we need to make sure that only variants that are rare and functionally important are included in the data, to satisfy the assumptions of the method. For example, we used MAF < 0.1%, CADD score > 15, LoF and missense variants as filtering criterion for the Boston Early-Onset COPD data, described in the paper. Since we are looking for extremely rare variant of large effect for the disease, such filtering is reasonable. 

  To ensure that only one founder case is present for each family, we will trim the pedigrees to keep only the founder case that is related to most other non-founder cases if necessary.

- Second, we need to compute the conditional probabilities that a variant segregates in the family conditional on that variant in present in one of the founders.

- Third, we need to compute the marginal probability that at least one variant in the gene segregates in the family.

- Fourth, we need to compute the final p-value using the marginal segregation probabilities computed above. 

These steps can be done using the GESE function in one call:

```{r, results="hide"}
results2 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5)
results2
```

```{r, echo=FALSE}
results2
```

This call computes the GESE p-value using resampling with 1e5 times of simulations (so the smallest p-value is 1e-5). There are several other matrices returned by this call, such as the conditional segregation probability (`condSegProb`) - the conditional probability of segregation event in the family conditioning on that one of the (most recent common) founders introduced the variant to the family; the gene-based segregation probability (`segProbGene`) - the marginal probability that any variant in the gene is segregating in the family; and the variant-based and gene-based segregation matrices (`varSeg` and `segregation`).

The most useful results containing the GESE p-value here is:
```{r}
results2$results
```


## 3.3 Compute the weighted gene-based segregation test.

We can also incorporate additional information, such as disease severity of the cases in the families, in the weighting of the families. We can also compute the p-value for such weighted test, using resampling.

Suppose we have a disease severity measure of the cases in the families, and we are using such weighting of families for all the genes.

```{r, results="hide"}
### creating weights for the families (same weights for all genes)
fams <- unique(dataRaw$FID)
nfam = length(fams)
temp = runif(nfam)
famWeight <- temp/sum(temp)
weightFam = data.frame(FID=fams, weight=famWeight)
results3 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5,
             familyWeight=weightFam)
results3$results
```

```{r, echo=FALSE}
results3$results
```


## 3.4. Other useful methods

There are a few public methods that were used in the GESE test and may also be useful in other contexts.

## 3.4.1. Trimming the pedigree file

This method `trim_oneLineage` accepts the complete pedigree information and selects only one lineage per pedigree. The lineage is selected such that the maximum possible number of sequenced subects (cases) are included in the selected lineage. Other family-based methods, such as PVAAST, also requires one lineage in each pedigree only. 

The method `trim_unrelated` deals with families with multiple founder cases, which violates our assumption that only one founder introduced the causal variant. It removes the minimal number of founder cases so that the pedigree does not violate the assumption of the method.

## 3.4.2. Compute the conditional segregating probability

The method `condSegProbF` computes the probability that the variant is segregating in the family given that it is introducted by the most recent common ancestors of the cases in the family.