---
title: "Introduction to snpsettest"
author: "Jaehyun Joo"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to snpsettest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
data.table::setDTthreads(1)
```

This vignette shows you how to perform gene-based association tests using GWAS
summary statistics in which sets of SNPs are defined by genes.

***

## GWAS summary statistics

The **snpsettest** requires SNP-level p-values to perform gene-based association
tests.

```{r GWAS summary}
library(snpsettest)

# Check an example of GWAS summary file (included in this package)
head(exGWAS, 3)
```

## Reference data

To infer the relationships among SNPs, the **snpsettest** package requires a
reference data set. The GWAS genotype data itself can be used as the reference
data (If the GWAS cohort is large, it is impractical to use genotype data of all
individuals. It would be sufficient to randomly select 1,000 unrelated
individuals for inferring pairwise LD correlations among common SNPs).
Otherwise, you could use publicly available data, such as the 1000 Genomes
(please see the companion vignette for [processing the 1000 Genomes
data](reference_1000Genomes.html)). This package accepts PLINK 1 binary files
(.bed, .bim, .fam) as an input. We can use `read_reference_bed` to read them
into R.

```{r reference data}
# Path to .bed file
bfile <- system.file("extdata", "example.bed", package = "snpsettest")

# Read a .bed file using bed.matrix-class in gaston package
# Genotypes are retrieved on demand to manage large-scale genotype data
x <- read_reference_bed(bfile, verbose = FALSE)
```

## Harmonize GWAS summary to the reference data

Pre-processing of GWAS summary data is required because the sets of variants
available in a particular GWAS might be poorly matched to the variants in
reference data. SNP matching can be performed using `harmonize_sumstats` either
1) by SNP ID or 2) by chromosome code, base-pair position, and allele codes,
while taking into account reference allele swap and possible strand flips.

```{r harmonization}
# Harmonize by SNP IDs
hsumstats1 <- harmonize_sumstats(exGWAS, x)

# Harmonize by genomic position and allele codes
# Reference allele swap will be taken into account (e.g., A/C match C/A)
hsumstats2 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE)

# Check matching entries by flipping allele codes (e.g., A/C match T/G)
# Ambiguous SNPs will be excluded from harmonization
hsumstats3 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE, check_strand_flip = TRUE)
```

## Map SNPs to genes

To perform gene-based association tests, it is necessary to annotate SNPs onto
their neighboring genes. Mapping SNPs to genes (or genomic regions) can be
achieved by `map_snp_to_genes` with gene start/end information.

```{r snp2gene}
# Check gene information from the GENCODE project (included in this package)
head(gene.curated.GRCh37, 3)

# Map SNPs to genes
snp_sets <- map_snp_to_gene(hsumstats1, gene.curated.GRCh37)
str(snp_sets$sets[1:5])

# Allows a certain kb window before/after the gene to be included for SNP mapping
snp_sets_50kb <- map_snp_to_gene(
  hsumstats1, gene.curated.GRCh37, 
  extend_start = 50, extend_end = 50 # default is 20kb
)
```

## Perform gene-based association tests

Once we have SNP sets for genes, `snpset_test` can be used to perform
gene-based association tests.

```{r tests}
# Perform gene-based association tests for the first 5 genes
res <- snpset_test(hsumstats1, x, snp_sets$sets[1:5])

# Show output
res
```