Title: | A Set-Based Association Test using GWAS Summary Statistics |
---|---|
Description: | The goal of 'snpsettest' is to provide simple tools that perform set-based association tests (e.g., gene-based association tests) using GWAS (genome-wide association study) summary statistics. A set-based association test in this package is based on the statistical model described in VEGAS (versatile gene-based association study), which combines the effects of a set of SNPs accounting for linkage disequilibrium between markers. This package uses a different approach from the original VEGAS implementation to compute set-level p values more efficiently, as described in <https://github.com/HimesGroup/snpsettest/wiki/Statistical-test-in-snpsettest>. |
Authors: | Jaehyun Joo [aut, cre], Blanca Himes [aut] |
Maintainer: | Jaehyun Joo <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2024-12-04 07:16:54 UTC |
Source: | CRAN |
An example file of GWAS summary statistics
exGWAS
exGWAS
Data frame with columns
SNP ID.
chromosome.
base-pair position.
allele codes.
p value.
head(exGWAS)
head(exGWAS)
Human gene information was extracted from the GENCODE release 19. This data only contains 'KNOWN' status genes with the following gene biotypes: protein-coding, Immunoglobulin (Ig) variable chain and T-cell receptor (TcR) genes.
gene.curated.GRCh37
gene.curated.GRCh37
Data frame with columns
SNP ID.
chromosome.
genomic start location (1-based).
genomic end location.
genomic strand.
gene symbols mapped to the GENCODE genes.
gene biotypes in the GENCODE genes.
https://www.gencodegenes.org/human/release_19.html
head(gene.curated.GRCh37)
head(gene.curated.GRCh37)
Human gene information was extracted from the GENCODE release 37. This data only contains genes with the following gene biotypes: protein-coding, Immunoglobulin (Ig) variable chain and T-cell receptor (TcR) genes.
gene.curated.GRCh38
gene.curated.GRCh38
Data frame with columns
SNP ID.
chromosome.
genomic start location (1-based).
genomic end location.
genomic strand.
gene symbols mapped to the GENCODE genes.
gene biotypes in the GENCODE genes.
https://www.gencodegenes.org/human/release_37.html
head(gene.curated.GRCh38)
head(gene.curated.GRCh38)
Finds an intersection of variants between GWAS summary and reference data.
harmonize_sumstats( sumstats, x, match_by_id = TRUE, check_strand_flip = FALSE, return_indice = FALSE )
harmonize_sumstats( sumstats, x, match_by_id = TRUE, check_strand_flip = FALSE, return_indice = FALSE )
sumstats |
A data frame with two columns: "id" and "pvalue".
If
|
x |
A |
match_by_id |
If |
check_strand_flip |
Only applies when |
return_indice |
Only applied when |
Pre-processing of GWAS summary data is required because the sets of variants available in a particular GWAS might be poorly matched to the variants in reference data. SNP matching can be performed either 1) by SNP ID or 2) by chromosome code, base-pair position, and allele codes, while taking into account possible strand flips and reference allele swap. For matched entries, the SNP IDs in GWAS summary data are replaced with the ones in the reference data.
A data frame with columns: "id", "chr", "pos", "A1", "A2" and
"pvalue". If return_indice = TRUE
, the data frame includes additional
columns key_
, swapped_
, and flipped_
. key_
is "chr_pos_A1_A2" in
sumstat
(the original input before harmonization). swapped_
contains a
logical vector indicating reference allele swap. flipped_
contains a
logical vector indicating strand flip.
## GWAS summary statistics head(exGWAS) ## Load reference genotype data bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) ## Harmonize by SNP IDs hsumstats1 <- harmonize_sumstats(exGWAS, x) ## Harmonize by genomic position and allele codes ## Reference allele swap will be taken into account hsumstats2 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE) ## Check matching entries by flipping allele codes ## Ambiguous SNPs will be excluded from harmonization hsumstats3 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE, check_strand_flip = TRUE)
## GWAS summary statistics head(exGWAS) ## Load reference genotype data bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) ## Harmonize by SNP IDs hsumstats1 <- harmonize_sumstats(exGWAS, x) ## Harmonize by genomic position and allele codes ## Reference allele swap will be taken into account hsumstats2 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE) ## Check matching entries by flipping allele codes ## Ambiguous SNPs will be excluded from harmonization hsumstats3 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE, check_strand_flip = TRUE)
Annotate SNPs onto their neighboring genes (or arbitrary genomic regions) to perform set-based association tests.
map_snp_to_gene( info_snp, info_gene, extend_start = 20L, extend_end = 20L, only_sets = FALSE )
map_snp_to_gene( info_snp, info_gene, extend_start = 20L, extend_end = 20L, only_sets = FALSE )
info_snp |
A data frame with columns: "id", "chr", and "pos".
|
info_gene |
A data frame with columns: "gene.id", "chr", "start", and "end".
If a gene has multiple intervals, SNPs mapped to any of them will be merged into a single set. Please assign unique IDs if you don't want this behavior. |
extend_start |
A single non-negative integer, allowing for a certain kb window before the gene to be included. Default is 20 (= 20kb). |
extend_end |
A single non-negative integer, allowing for a certain kb window after the gene to be included. Default is 20 (= 20kb). |
only_sets |
If |
A nested list containing following components:
sets: a named list where each index represents a separate set of SNPs
map: a data frame containing SNP mapping information
## GWAS summary statistics head(exGWAS) ## Gene information data head(gene.curated.GRCh37) ## Map SNPs to genes snp_sets <- map_snp_to_gene(exGWAS, gene.curated.GRCh37) ## Better to use harmonized GWAS data for gene mapping bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) hsumstats <- harmonize_sumstats(exGWAS, x) snp_sets <- map_snp_to_gene(hsumstats, gene.curated.GRCh37)
## GWAS summary statistics head(exGWAS) ## Gene information data head(gene.curated.GRCh37) ## Map SNPs to genes snp_sets <- map_snp_to_gene(exGWAS, gene.curated.GRCh37) ## Better to use harmonized GWAS data for gene mapping bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) hsumstats <- harmonize_sumstats(exGWAS, x) snp_sets <- map_snp_to_gene(hsumstats, gene.curated.GRCh37)
Create a bed.matrix
object from a .bed file. The function expects
.fam and .bim files under the same directory. See gaston::read.bed.matrix
for more details.
read_reference_bed(path, ...)
read_reference_bed(path, ...)
path |
A path to the .bed file |
... |
Further arguments used in gaston::read.bed.matrix |
A gaston::bed.matrix object with a Z-standardized genotype matrix
## Get a path to the example .bed file bfile <- system.file("extdata", "example.bed", package = "snpsettest") ## Read a .bed file x <- read_reference_bed(path = bfile)
## Get a path to the example .bed file bfile <- system.file("extdata", "example.bed", package = "snpsettest") ## Read a .bed file x <- read_reference_bed(path = bfile)
Perform set-based association tests between multiple sets of SNPs and a phenotype using GWAS summary statistics. If the function encounters missing genotypes in the reference data, they will be imputed with genotype means.
snpset_test(hsumstats, x, snp_sets, method = c("saddle", "davies"))
snpset_test(hsumstats, x, snp_sets, method = c("saddle", "davies"))
hsumstats |
A data frame processed by |
x |
A |
snp_sets |
A named list where each index represents a separate set of SNPs. |
method |
A method to compute a set-level p value. "saddle" uses Kuonen's saddlepoint approximation (1999) and "davies" uses the algorithm of Davies (1980). When "davies" method failed to produce a meaningful result, "saddle" method is used as a fallback. Default is "saddle". |
A data.table with columns: "set.id", "pvalue", "n.snp", "top.snp.id" and "top.snp.pvalue"
set.id = a name of SNP set
tstat = a test statistic
pvalue = a set-level p value
n.snp = the number of SNPs used in a test
top.snp.id = SNP ID with the smallest p-value within a set of SNPs
top.snp.pvalue = The smallest p-value within a set of SNPs
Kuonen, D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika 86, 929–935 (1999).
Davies, R. B. Algorithm AS 155: The Distribution of a Linear Combination of Chi-Square Random Variables. Journal of the Royal Statistical Society. Series C (Applied Statistics) 29, 323–333 (1980).
## GWAS summary statistics head(exGWAS) ## Load reference genotype data bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) ## GWAS harmonization with reference data hsumstats <- harmonize_sumstats(exGWAS, x) ## Perform a set-based test with an arbitrary SNP set snpset_test(hsumstats, x, list(test = c("SNP_880", "SNP_1533", "SNP_4189"))) ## Gene information data head(gene.curated.GRCh37) ## Map SNPs to genes snp_sets <- map_snp_to_gene(hsumstats, gene.curated.GRCh37) ## Perform gene-based association tests out <- snpset_test(hsumstats, x, snp_sets$sets)
## GWAS summary statistics head(exGWAS) ## Load reference genotype data bfile <- system.file("extdata", "example.bed", package = "snpsettest") x <- read_reference_bed(path = bfile) ## GWAS harmonization with reference data hsumstats <- harmonize_sumstats(exGWAS, x) ## Perform a set-based test with an arbitrary SNP set snpset_test(hsumstats, x, list(test = c("SNP_880", "SNP_1533", "SNP_4189"))) ## Gene information data head(gene.curated.GRCh37) ## Map SNPs to genes snp_sets <- map_snp_to_gene(hsumstats, gene.curated.GRCh37) ## Perform gene-based association tests out <- snpset_test(hsumstats, x, snp_sets$sets)