episcan provides efficient methods to scan for pairwise epistasis in both case-control study and quantitative studies. It is suitable for genome-wide interaction studies (GWIS) by splitting the computation into manageable chunks. The epistasis methods used by episcan are adjusted from two published papers (Kam-Thong, Czamara, et al. 2011; Kam-Thong, Pütz, et al. 2011).
First, we generate a small genotype dataset (geno
) with
sample size of 100 subjects and 100 variables (e.g., SNPs) as well as a
case-control phenotype (p
).
set.seed(321)
geno <- matrix(sample(0:2, size = 10000,
replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 100)
dimnames(geno) <- list(row = paste0("IND", 1:nrow(geno)),
col = paste0("rs", 1:ncol(geno)))
p <- c(rep(0, 60), rep(1, 40))
geno[1:5, 1:8]
#> col
#> row rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8
#> IND1 2 0 0 0 0 1 0 2
#> IND2 2 0 0 0 1 0 1 2
#> IND3 0 0 0 2 2 1 0 0
#> IND4 0 2 1 0 0 0 1 1
#> IND5 0 0 2 0 1 0 0 0
To use episcan, simply start with the main function
episcan
. There are three mandatory parameters to be set by
the user: genotype data, phenotype data and phenotype category
(“case-control” or “quantitative”). Since the data simulated above is
not normalized yet, we need to set scale = TRUE
. By passing
an integer number to parameter chunksize
, the genotype data
will be split into several chunks of that size during the calculation.
For the example above (using geno
),
chunksize = 20
means each chunk has 20 variables(variants)
and the total number of chunks is 5. Moreover, in most cases, the result
of epistasis analysis is huge due to the large number of the variable
(variants) combinations. To reduce the size of the result file, setting
a threshold of the statistical test (zpthres
) to have an
output cut-off is a practical option.
episcan(geno1 = geno,
pheno = p,
phetype = "case-control",
outfile = "episcan_1geno_cc",
suffix = ".txt",
zpthres = 0.9,
chunksize = 20,
scale = TRUE)
#> p-value threshold of Z test for output: 0.9
#> set chunksize: 20
#> [1] "episcan starts:"
#> [1] "Fri Dec 6 06:48:00 2024"
#> [1] "1 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "2 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "3 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "4 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "5 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "epiblaster calculation is over!"
#> [1] "Fri Dec 6 06:48:00 2024"
The result of episcan
is stored in the specified file
(“episcan_1geno_cc.txt”). Let’s take a look:
result <- read.table("episcan_1geno_cc.txt",
header = TRUE,
stringsAsFactors = FALSE)
head(result)
#> SNP1 SNP2 Zscore ZP
#> 1 rs1 rs2 1.9376903 0.05266102
#> 2 rs1 rs3 1.3806888 0.16737465
#> 3 rs2 rs3 -1.6262266 0.10390145
#> 4 rs1 rs4 0.4553033 0.64889105
#> 5 rs2 rs4 -0.5337472 0.59351648
#> 6 rs3 rs4 0.6192260 0.53576750
In a genome-wide level epistasis study, it is usual to have millions
of variables (variants). Analyzing such big data is super
time-consuming. The common appoach is to parallelize the task and run
the subtasks with High Performance Computing (HPC) techniques, e.g., on
a cluster. By splitting genotype data per chromosome, the huge epistasis
analysis task can be divided into relatively small tasks. If only 22
chromosomes exist in the initial task, there are 253 ((1+22)*22/2)
subtasks after splitting and considering all the combinations of the
chromosomes. episcan
supports two inputs of genotype data
by simply passing information to “geno1” and “geno2”.
# simulate data
geno1 <- matrix(sample(0:2, size = 10000,
replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 100)
geno2 <- matrix(sample(0:2, size = 20000,
replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 200)
dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)),
col = paste0("rs", 1:ncol(geno1)))
dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)),
col = paste0("exm", 1:ncol(geno2)))
p <- rnorm(100)
# scan epistasis
episcan(geno1 = geno1,
geno2 = geno2,
pheno = p,
phetype = "quantitative",
outfile = "episcan_2geno_quant",
suffix = ".txt",
zpthres = 0.6,
chunksize = 50,
scale = TRUE)
#> p-value threshold of Z test for output: 0.6
#> set chunksize: 50
#> [1] "episcan starts:"
#> [1] "Fri Dec 6 06:48:00 2024"
#> [1] "1 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "2 chunk loop: Fri Dec 6 06:48:00 2024"
#> [1] "epiHSIC calculation is over!"
#> [1] "Fri Dec 6 06:48:00 2024"