---
title: "Processing 1000 Genomes data for set-based association tests"
author: "Jaehyun Joo"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Processing 1000 Genomes data for set-based association tests}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

A set-based association test in the **snpsettest** package requires a reference
data set to infer pairwise linkage disequilibrium (LD) values between a set of
variants. This vignette shows you how to use 1000 Genomes data as the reference
data for set-based association tests.

## Prerequisites

[PLINK 2.0](https://www.cog-genomics.org/plink/2.0/) is required to process the
1000 Genomes dataset. The 1000 Genomes phase 3 dataset (GRCh37) is available in
PLINK2 binary format at [PLINK 2.0
Resources](https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3). To download files, 

```{bash 1000 Genomes download, eval = FALSE}
# The links in here may be changed in future
# "-O" to specify output file name
wget -O all_phase3.psam "https://www.dropbox.com/s/yozrzsdrwqej63q/phase3_corrected.psam?dl=1"
wget -O all_phase3.pgen.zst "https://www.dropbox.com/s/afvvf1e15gqzsqo/all_phase3.pgen.zst?dl=1"
wget -O all_phase3.pvar.zst "https://www.dropbox.com/s/op9osq6luy3pjg8/all_phase3.pvar.zst?dl=1"

# Decompress pgen.zst to pgen
plink2 --zst-decompress all_phase3.pgen.zst > all_phase3.pgen
```

## Choose an appropriate population

Patterns of LD could vary among racial/ethnic groups, and thus, it may be
necessary to choose an appropriate population. For example, if your GWAS is
based on European descent, you may want to keep **EUR** samples as described in
the "all_phase3.psam" file.

```{bash sub population, eval = FALSE}
# "vzs" modifier to directly operate with pvar.zst
# "--chr 1-22" excludes all variants not on the listed chromosomes
# "--output-chr 26" uses numeric chromosome codes
# "--max-alleles 2": PLINK 1 binary does not allow multi-allelic variants
# "--rm-dup" removes duplicate-ID variants
# "--set-missing-var-id" replaces missing IDs with a pattern
plink2 --pfile all_phase3 vzs \
       --chr 1-22 \
       --output-chr 26 \
       --max-alleles 2 \
       --rm-dup exclude-mismatch \
       --set-missing-var-ids '@_#_$1_$2' \
       --make-pgen \
       --out all_phase3_autosomes

# Prepare sub-population filter file
awk 'NR == 1 || $5 == "EUR" {print $1}' all_phase3.psam > EUR_1kg_samples.txt

# Generate sub-population fileset
plink2 --pfile all_phase3_autosomes \
       --keep EUR_1kg_samples.txt \
       --make-pgen \
       --out EUR_phase3_autosomes
```

## Convert the 1000 Genomes data to PLINK 1 binary format

The **snpsettest** package uses PLINK 1 binary files to read them into R. The
PLINK2 binary fileset (pgen/pvar/psam) can be easily converted to PLINK 1 binary
fileset (bed/bim/fam).

```{bash pgen2bed, eval = FALSE}
# pgen to bed
# "--maf 0.005" remove most monomorphic SNPs 
# (still may have some when all samples are heterozyguous -> maf=0.5)
plink2 --pfile EUR_phase3_autosomes \
       --maf 0.005 \
       --make-bed \
       --out EUR_phase3_autosomes
       
# Split bed/bim/fam by chromosome
for i in {1..22}
do plink2 --bfile EUR_phase3_autosomes --chr $i --make-bed --out EUR_phase3_chr$i
done
```

For the **snpsettest** package, it is better to split your reference data by
chromosome and run set-based association tests with per-chromosome binary files.
For instance, when you perform set-based association tests for genes on
chromosome 1, you don't have to load genotype data for other chromosomes into R.
Intuitively, as more redundant SNPs are included in the reference data, your
tests will get (often significantly) slower and consume more memory.

## Read PLINK 1 binary fileset to R session

This package uses a bed.matrix-class from the
[**gaston**](https://CRAN.R-project.org/package=gaston) package to attach
genotype data to R session. Genotypes are retrieved on demand to manage
large-scale genotype data.

```{r toR, eval = FALSE}
library(snpsettest)

# Read chromosome 1 bed/bim/fam files
x <- read_reference_bed("/path/to/EUR_phase3_chr1.bed")
```