Title: | Scan Pairwise Epistasis |
---|---|
Description: | Searching genomic interactions with linear/logistic regression in a high-dimensional dataset is a time-consuming task. This package provides some efficient ways to scan epistasis in genome-wide interaction studies (GWIS). Both case-control status (binary outcome) and quantitative phenotype (continuous outcome) are supported (the main references: 1. Kam-Thong, T., D. Czamara, K. Tsuda, K. Borgwardt, C. M. Lewis, A. Erhardt-Lehmann, B. Hemmer, et al. (2011). <doi:10.1038/ejhg.2010.196>. 2. Kam-Thong, T., B. Pütz, N. Karbalai, B. Müller-Myhsok, and K. Borgwardt. (2011). <doi:10.1093/bioinformatics/btr218>.) |
Authors: | Beibei Jiang <[email protected]> and Benno Pütz <[email protected]> |
Maintainer: | Beibei Jiang <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.1 |
Built: | 2024-12-06 06:48:02 UTC |
Source: | CRAN |
Check the chunk size whether it is over the given number of variables(vaiants) in genotype data. If yes, reset the chunk size equal to the number of variables(vaiants).
checkchunksize(c, m, n = NULL, ...)
checkchunksize(c, m, n = NULL, ...)
c |
an integer indicating the set chunk size. |
m |
an integer indicating the number of variables(vaiants) in |
n |
an integer indicating the number of variables(vaiants) in |
... |
not used. |
an integer indicating the chunk size
set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) # if chunk size is smaller, there is no problem chunksize <- 10 checkchunksize(chunksize, ncol(geno1)) # if chunk size is bigger than the number of columns in genotype input, # set chunk size equal to the number of columns in genotype input chuksize <- 12 checkchunksize(chunksize, ncol(geno1)) # if chunk size is bigger than the number of columns of geno1 and geno2, # set chunk size equal to the minima nunumber of columns of geno1 and geno2 chunksize <- 50 checkchunksize(chunksize, ncol(geno1), ncol(geno2))
set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) # if chunk size is smaller, there is no problem chunksize <- 10 checkchunksize(chunksize, ncol(geno1)) # if chunk size is bigger than the number of columns in genotype input, # set chunk size equal to the number of columns in genotype input chuksize <- 12 checkchunksize(chunksize, ncol(geno1)) # if chunk size is bigger than the number of columns of geno1 and geno2, # set chunk size equal to the minima nunumber of columns of geno1 and geno2 chunksize <- 50 checkchunksize(chunksize, ncol(geno1), ncol(geno2))
test with one genotype inputCalculate the difference of correlation coefficents between cases and controls,
conduct test for the differences (values) and choose variant pairs with the significance below the given threshold for output.
epiblaster1geno(geno, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
epiblaster1geno(geno, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
geno |
is the normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object (from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
pheno |
a vector containing the binary phenotype information (case/control). The values are either 0 (control) or 1 (case). |
chunk |
is the number of variants in each chunk. Default: 1000. |
zpthres |
is the significance threshold to select variant pairs for output. Default is 1e-6. |
outfile |
is the base of out filename. Default: 'NONE'. |
suffix |
is the suffix of out filename. Default: '.txt'. |
... |
not used. |
null
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) p1 <- c(rep(0, 60), rep(1, 40)) # normalized data geno1 <- scale(geno1) # one genotype with case-control phenotype epiblaster1geno(geno = geno1, pheno = p1, outfile = "episcan_1geno_cc", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_1geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) p1 <- c(rep(0, 60), rep(1, 40)) # normalized data geno1 <- scale(geno1) # one genotype with case-control phenotype epiblaster1geno(geno = geno1, pheno = p1, outfile = "episcan_1geno_cc", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_1geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
Calculate the difference of correlation coeficents between cases and controls, conduct Z test for the differences (values) and choose variant pairs with the significance below the given threshold for output.
epiblaster2genos(geno1, geno2, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
epiblaster2genos(geno1, geno2, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
geno1 |
is the first normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
geno2 |
is the second normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
pheno |
a vector containing the binary phenotype information (case/control). The values are either 0 (control) or 1 (case). |
chunk |
is the number of variants in each chunk. |
zpthres |
is the significance threshold to select variant pairs for output. Default is 1e-6. |
outfile |
is the prefix of out filename. |
suffix |
is the suffix of out filename. |
... |
not used. |
null
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p1 <- c(rep(0, 60), rep(1, 40)) # normalized data geno1 <- scale(geno1) geno2 <- scale(geno2) # two genotypes with quantitative phenotype epiblaster2genos(geno1 = geno1, geno2 = geno2, pheno = p1, outfile = "episcan_2geno_cc", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_2geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p1 <- c(rep(0, 60), rep(1, 40)) # normalized data geno1 <- scale(geno1) geno2 <- scale(geno2) # two genotypes with quantitative phenotype epiblaster2genos(geno1 = geno1, geno2 = geno2, pheno = p1, outfile = "episcan_2geno_cc", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_2geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
Calculate HSIC values
epiHSIC(A = NULL, B = NULL, P = NULL, ...)
epiHSIC(A = NULL, B = NULL, P = NULL, ...)
A |
is one matrix. |
B |
is one matrix. |
P |
is "phenoype", a vector. |
... |
not used. |
a matrix
Beibei Jiang [email protected]
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) epiHSIC(A = scale(geno1), B = scale(geno2), P = rnorm(100))
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) epiHSIC(A = scale(geno1), B = scale(geno2), P = rnorm(100))
Calculate the significance of epistasis according the definition of HSIC, conduct test for HSIC values and
choose variant pairs with the significance below the given threshold for output.
epiHSIC1geno(geno = NULL, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
epiHSIC1geno(geno = NULL, pheno, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
geno |
is the normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
pheno |
is a vector containing the normalized phenotype information. |
chunk |
is the number of variants in each chunk. |
zpthres |
is is the significance threshold to select variant pairs for output. Default is 1e-6. |
outfile |
is the basename of out filename. |
suffix |
is the suffix of out filename. |
... |
not used. |
null
Beibei Jiang [email protected]
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) p2 <- rnorm(100, mean = 5, sd = 10) # normalized data geno1 <- scale(geno1) p2 <- as.vector(unlist(scale(p2))) # one genotypes with quantitative phenotype epiHSIC1geno(geno = geno1, pheno = p2, outfile = "episcan_1geno_quant", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_1geno_quant.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) p2 <- rnorm(100, mean = 5, sd = 10) # normalized data geno1 <- scale(geno1) p2 <- as.vector(unlist(scale(p2))) # one genotypes with quantitative phenotype epiHSIC1geno(geno = geno1, pheno = p2, outfile = "episcan_1geno_quant", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_1geno_quant.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
Calculate the significance of epistasis according the definition of HSIC, conduct Z test for HSIC values and choose variant pairs with the significance below the given threshold for output.
epiHSIC2genos(geno1 = NULL, geno2 = NULL, pheno = NULL, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
epiHSIC2genos(geno1 = NULL, geno2 = NULL, pheno = NULL, chunk = 1000, zpthres = 1e-05, outfile = "NONE", suffix = ".txt", ...)
geno1 |
is the first normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
geno2 |
is the second normalized genotype data. It can be a matrix or a dataframe, or a big.matrix object from bigmemory. The columns contain the information of variables and the rows contain the information of samples. |
pheno |
is a vector containing the normalized phenotype information. |
chunk |
is the number of variants in each chunk. |
zpthres |
is the significance threshold for cut-off output of the variant pairs. |
outfile |
is the basename of out filename. |
suffix |
is the suffix of out filename. |
... |
not used |
null
# simulate some data set.seed(123) n1 <- 10; n2 <- 15; rows <- 10 geno1 <- matrix(sample(0:2, size = n1*rows, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = n1) geno2 <- matrix(sample(0:2, size = n2*rows, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = n2) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p2 <- rnorm(rows, mean = 5, sd = 10) # normalized data geno1 <- scale(geno1) geno2 <- scale(geno2) p2 <- as.vector(unlist(scale(p2))) # two genotypes with quantitative phenotype epiHSIC2genos(geno1 = geno1, geno2 = geno2, pheno = p2, outfile = "episcan_2geno_quant", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_2geno_quant.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
# simulate some data set.seed(123) n1 <- 10; n2 <- 15; rows <- 10 geno1 <- matrix(sample(0:2, size = n1*rows, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = n1) geno2 <- matrix(sample(0:2, size = n2*rows, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = n2) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p2 <- rnorm(rows, mean = 5, sd = 10) # normalized data geno1 <- scale(geno1) geno2 <- scale(geno2) p2 <- as.vector(unlist(scale(p2))) # two genotypes with quantitative phenotype epiHSIC2genos(geno1 = geno1, geno2 = geno2, pheno = p2, outfile = "episcan_2geno_quant", suffix = ".txt", zpthres = 0.9, chunk = 10) # take a look at the result res <- read.table("episcan_2geno_quant.txt", header = TRUE, stringsAsFactors = FALSE) head(res)
Genomic interaction analysis with EPIBLASTER or epistasis-oriented Hilbert–Schmidt Independence Criterion (HSIC).
episcan(geno1, geno2 = NULL, pheno = NULL, phetype = c("case-control", "quantitative"), outfile = "episcan", suffix = ".txt", zpthres = 1e-06, chunksize = 1000, scale = TRUE, ...)
episcan(geno1, geno2 = NULL, pheno = NULL, phetype = c("case-control", "quantitative"), outfile = "episcan", suffix = ".txt", zpthres = 1e-06, chunksize = 1000, scale = TRUE, ...)
geno1 |
a data.frame or matrix of the first genotype data. |
geno2 |
optional. A data.frame or matrix of the second genotype data. |
pheno |
a vector (named or not). If not provided, the value of |
phetype |
character string. Either "case-control" or "quantitative". |
outfile |
output file name. Default is "episcan". |
suffix |
suffix for output file. Default is ".txt". The final result will be stored in |
zpthres |
is the significance threshold to select variant pairs for output. Default is 1e-6. |
chunksize |
the number of variants in each chunk. |
scale |
a logical value to define wheter the input data needs to be normalized. Default is TRUE which means, by default, all the genotype data will be normalized and if the phetype is "quantitative", the phenotype will also be normalized. |
... |
not used. |
null
Beibei Jiang [email protected]
Kam-Thong, T., D. Czamara, K. Tsuda, K. Borgwardt, C. M. Lewis, A. Erhardt-Lehmann, B. Hemmer, et al. 2011. "EPIBLASTER-Fast Exhaustive Two-Locus Epistasis Detection Strategy Using Graphical Processing Units." Journal Article. European Journal of Human Genetics 19 (4): 465–71. https://doi.org/10.1038/ejhg.2010.196.
Kam-Thong, T., B. Pütz, N. Karbalai, B. Müller-Myhsok, and K. Borgwardt. 2011. "Epistasis Detection on Quantitative Phenotypes by Exhaustive Enumeration Using GPUs." Journal Article. Bioinformatics 27 (13): i214–21. https://doi.org/10.1093/bioinformatics/btr218.
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p1 <- c(rep(0, 60), rep(1, 40)) p2 <- rnorm(100) # one genotype with case-control phenotype episcan(geno1 = geno1, geno2 = NULL, pheno = p1, phetype = "case-control", outfile = "episcan_1geno_cc", suffix = ".txt", zpthres = 0.9, chunksize = 10, scale = TRUE) # take a look at the result res <- read.table("episcan_1geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res) # two genotypes with quantitative phenotype episcan(geno1 = geno1, geno2 = geno2, pheno = p2, phetype = "quantitative", outfile = "episcan_2geno_quant", suffix = ".txt", zpthres = 0.9, chunksize = 10, scale = TRUE)
# simulate some data set.seed(123) geno1 <- matrix(sample(0:2, size = 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)), ncol = 10) geno2 <- matrix(sample(0:2, size = 2000, replace = TRUE, prob = c(0.4, 0.3, 0.3)), ncol = 20) dimnames(geno1) <- list(row = paste0("IND", 1:nrow(geno1)), col = paste0("rs", 1:ncol(geno1))) dimnames(geno2) <- list(row = paste0("IND", 1:nrow(geno2)), col = paste0("exm", 1:ncol(geno2))) p1 <- c(rep(0, 60), rep(1, 40)) p2 <- rnorm(100) # one genotype with case-control phenotype episcan(geno1 = geno1, geno2 = NULL, pheno = p1, phetype = "case-control", outfile = "episcan_1geno_cc", suffix = ".txt", zpthres = 0.9, chunksize = 10, scale = TRUE) # take a look at the result res <- read.table("episcan_1geno_cc.txt", header = TRUE, stringsAsFactors = FALSE) head(res) # two genotypes with quantitative phenotype episcan(geno1 = geno1, geno2 = geno2, pheno = p2, phetype = "quantitative", outfile = "episcan_2geno_quant", suffix = ".txt", zpthres = 0.9, chunksize = 10, scale = TRUE)
Fast calculation of correlation matrix on CPU (the idea is from WGCNA fast function for pearson correlations)
getcor(A = NULL, B = NULL, method = "pearson", ...)
getcor(A = NULL, B = NULL, method = "pearson", ...)
A |
is a matrix or data.frame. |
B |
is a matrix or data.frame. |
method |
a character string indicating which correlation coefficient is to be computed. Current version only supports "pearson" correlation. |
... |
not used. |
correlation matrix
Beibei Jiang [email protected]
set.seed(123) A <- matrix(rnorm(100, mean = 5, sd = 10), ncol = 10) B <- matrix(rnorm(200, mean = 10, sd = 100), ncol = 20) C <- getcor(A, B)
set.seed(123) A <- matrix(rnorm(100, mean = 5, sd = 10), ncol = 10) B <- matrix(rnorm(200, mean = 10, sd = 100), ncol = 20) C <- getcor(A, B)
For proper use of this function it will return the set of variant indices
corresponding to the idx
-th chunk of size chunk
in n
variants, taking
care of the case that the last chunk might have less than n
elements.
If used with an idx
-value outside the possible chunks (i.e., negative or
larger than ceiling(n/chunk)
) an empty vector (numeric(0)
) is returned.
ithChunk(idx, n, chunk = 1000)
ithChunk(idx, n, chunk = 1000)
idx |
chunk index (which chunk, first is 1) |
n |
total number of variants |
chunk |
desired chunksize |
index range into variants for chunk idx
(see details)
Write out the result of epistasis analysis. Z score matrix is not a symmetric matrix.
WriteSnpPairs(Zmatrix, indexArr, outfile = "NONE", ...)
WriteSnpPairs(Zmatrix, indexArr, outfile = "NONE", ...)
Zmatrix |
is the Z score matrix (non-symmetric matrix). |
indexArr |
is the index of Zmarix whose z score is over the given |
outfile |
is the SNP pairs file for the second stage. |
... |
not used. |
null
Beibei Jiang [email protected]
Write out the result of epistasis analysis. Z score matrix is a symmetric matrix.
WriteSnpPairs_sym(Zmatrix, indexArr, outfile = "NONE", ...)
WriteSnpPairs_sym(Zmatrix, indexArr, outfile = "NONE", ...)
Zmatrix |
is the Z score matrix (symmetric matrix). |
indexArr |
is the index of Zmarix whose z score is over the given |
outfile |
is the SNP pairs file for the second stage. |
... |
not used. |
null
Beibei Jiang [email protected]
Convert Z score to correponding p-values
ZtoP(z.score, ...)
ZtoP(z.score, ...)
z.score |
Z-score(s) (either scalar or vector). |
... |
not used. |
corresponding -value(s).
Due to the IEEE number limits of representing doubles,
any score over 37.51929999999999765 will be converted to a
-value of 1e-309.
Beibei Jiang [email protected] and Benno Pütz [email protected]