--- title: "episcan" author: "Beibei Jiang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: episcan.bib vignette: > %\VignetteIndexEntry{episcan} %\VignetteEngine{knitr::knitr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(stringsAsFactors = FALSE) ``` **episcan** provides efficient methods to scan for pairwise epistasis in both case-control study and quantitative studies. It is suitable for genome-wide interaction studies (GWIS) by splitting the computation into manageable chunks. The epistasis methods used by **episcan** are adjusted from two published papers [@epiblaster; @epiHSIC]. ## Installation ```{r, eval = F, error = F, results='hide'} install.packages(episcan) ``` ## Sample implementation ```{r, echo = T, results='hide'} # load package library(episcan) ``` ### Small dataset First, we generate a small genotype dataset (`geno`) with sample size of 100 subjects and 100 variables (e.g., SNPs) as well as a case-control phenotype (`p`). ```{r, echo = T, results='markup'} set.seed(321) geno <- matrix(sample(0:2, size = 10000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 100) dimnames(geno) <- list(row = paste0("IND", 1:nrow(geno)), col = paste0("rs", 1:ncol(geno))) p <- c(rep(0, 60), rep(1, 40)) geno[1:5, 1:8] ``` To use **episcan**, simply start with the main function `episcan`. There are three mandatory parameters to be set by the user: genotype data, phenotype data and phenotype category ("case-control" or "quantitative"). Since the data simulated above is not normalized yet, we need to set `scale = TRUE`. By passing an integer number to parameter `chunksize`, the genotype data will be split into several chunks of that size during the calculation. For the example above (using `geno`), `chunksize = 20` means each chunk has 20 variables(variants) and the total number of chunks is 5. Moreover, in most cases, the result of epistasis analysis is huge due to the large number of the variable (variants) combinations. To reduce the size of the result file, setting a threshold of the statistical test (`zpthres`) to have an output cut-off is a practical option. ```{r, echo = T, results='markup'} episcan(geno1 = geno, pheno = p, phetype = "case-control", outfile = "episcan_1geno_cc", suffix = ".txt", zpthres = 0.9, chunksize = 20, scale = TRUE) ``` The result of `episcan` is stored in the specified file ("episcan_1geno_cc.txt"). Let's take a look: ```{r, echo = T, results='markup'} result <- read.table("episcan_1geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(result) ``` ### Big dataset In a genome-wide level epistasis study, it is usual to have millions of variables (variants). Analyzing such big data is super time-consuming. The common appoach is to parallelize the task and run the subtasks with High Performance Computing (HPC) techniques, e.g., on a cluster. By splitting genotype data per chromosome, the huge epistasis analysis task can be divided into relatively small tasks. If only 22 chromosomes exist in the initial task, there are 253 ((1+22)\*22/2) subtasks after splitting and considering all the combinations of the chromosomes. `episcan` supports two inputs of genotype data by simply passing information to "geno1" and "geno2". ```{r, echo = T, results='markup'} # simulate data geno1 <- matrix(sample(0:2, size = 10000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 100) geno2 <- matrix(sample(0:2, size = 20000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 200) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p <- rnorm(100) # scan epistasis episcan(geno1 = geno1, geno2 = geno2, pheno = p, phetype = "quantitative", outfile = "episcan_2geno_quant", suffix = ".txt", zpthres = 0.6, chunksize = 50, scale = TRUE) ``` ## References