Title: | The Knockoff Inference Using Summary Statistics |
---|---|
Description: | Functions for multiple knockoff inference using summary statistics, e.g. Z-scores. The knockoff inference is a general procedure for controlling the false discovery rate (FDR) when performing variable selection. This package provides a procedure which performs knockoff inference without ever constructing individual knockoffs (GhostKnockoff). It additionally supports multiple knockoff inference for improved stability and reproducibility. Moreover, it supports meta-analysis of multiple overlapping studies. |
Authors: | Zihuai He <[email protected]> |
Maintainer: | Zihuai He <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-11-15 06:53:30 UTC |
Source: | CRAN |
This function calculates Q-values to select variants associated with the outcome, given the feature importance scores
GhostKnockoff.filter(T_0,T_k)
GhostKnockoff.filter(T_0,T_k)
T_0 |
A p*1 matrix. The feature importance score for the original variants. |
T_k |
A p*M matrix where M is the number of hypothetical knockoffs. The knockoff feature importance scores. |
q |
A vector of length p, where each element is a Q-value. Variants with q <= target FDR will be selected. |
kappa |
A vector of length p. Multiple knockoff statistics kappa. |
tau |
A vector of length p. Multiple knockoff statistics tau. |
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) ### if only one study is involved Zscore_0<-as.matrix(rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal n.study<-c(5000) # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL) # knockoff filter GK.filter<-GhostKnockoff.filter(GK.stat$T_0,GK.stat$T_k) GK.filter$q ### if multiple studies are involved, for a meta-analysis # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # note that all steps above can be run in parallel for nonoverlapping blocks of the genome. # Then the overall study correlation can be computed by averaging cor.study of all blocks. # compute optimal weights and study dependency using overall study correlaton n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study) gamma<-Meta.prelim$gamma weight.study<-Meta.prelim$w_opt # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=gamma,weight.study=weight.study) # knockoff filter GK.filter<-GhostKnockoff.filter(GK.stat$T_0,GK.stat$T_k) GK.filter$q
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) ### if only one study is involved Zscore_0<-as.matrix(rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal n.study<-c(5000) # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL) # knockoff filter GK.filter<-GhostKnockoff.filter(GK.stat$T_0,GK.stat$T_k) GK.filter$q ### if multiple studies are involved, for a meta-analysis # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # note that all steps above can be run in parallel for nonoverlapping blocks of the genome. # Then the overall study correlation can be computed by averaging cor.study of all blocks. # compute optimal weights and study dependency using overall study correlaton n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study) gamma<-Meta.prelim$gamma weight.study<-Meta.prelim$w_opt # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=gamma,weight.study=weight.study) # knockoff filter GK.filter<-GhostKnockoff.filter(GK.stat$T_0,GK.stat$T_k) GK.filter$q
Generate original and knockoff feature importance scores given original Z-scores from multiple studies.
GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL)
GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL)
Zscore_0 |
A p*K Z-score matrix, where p is the number of variants and K is the number of studies. Variants not observed in the study should be coded as NA. |
n.study |
A vector of length K, where each element is the study sample size. |
fit.prelim |
The output of function "GhostKnockoff.prelim". |
gamma |
The estimated study dependency. See functon "GhostKnockoff.GetGamma". |
weight.study |
The weights to combine multiple studies. The default is based on sample size. The optimal weights can be estimated using function "GhostKnockoff.prelim.Meta". |
GK.Zscore_0 |
A vector of length p, where each element is weighted combination of original Z-scores |
GK.Zscore_k |
A p*M matrix, where each column is a knockoff copy of GK.Zscore_0. |
T_0 |
Feature importance score of original data: square of GK.Zscore_0. |
T_k |
Feature importance score of knockoff copies: square of GK.Zscore_k. |
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) ### if only one study is involved Zscore_0<-as.matrix(rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal n.study<-c(5000) # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL) ### if multiple studies are involved, for a meta-analysis # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # note that all steps above can be run in parallel for nonoverlapping blocks of the genome. # Then the overall study correlation can be computed by averaging cor.study of all blocks. # compute optimal weights and study dependency using overall study correlaton n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study) gamma<-Meta.prelim$gamma weight.study<-Meta.prelim$w_opt # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=gamma,weight.study=weight.study)
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) ### if only one study is involved Zscore_0<-as.matrix(rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal n.study<-c(5000) # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=1,weight.study=NULL) ### if multiple studies are involved, for a meta-analysis # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # note that all steps above can be run in parallel for nonoverlapping blocks of the genome. # Then the overall study correlation can be computed by averaging cor.study of all blocks. # compute optimal weights and study dependency using overall study correlaton n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study) gamma<-Meta.prelim$gamma weight.study<-Meta.prelim$w_opt # knockoff scores for each block, this can be run in parallel too GK.stat<-GhostKnockoff.fit(Zscore_0,n.study,fit.prelim,gamma=gamma,weight.study=weight.study)
This function computes correlation among studies given Z-scores and the output of GhostKnockoff.prelim.
GhostKnockoff.GetCorStudy(Zscore_0, fit.prelim)
GhostKnockoff.GetCorStudy(Zscore_0, fit.prelim)
Zscore_0 |
A p*K Z-score matrix, where p is the number of variants and K is the number of studies. Variants not observed in the study should be coded as NA. |
fit.prelim |
The output of function "GhostKnockoff.prelim". |
cor.study |
The correlation among studies. |
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim)
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim)
This function does the preliminary data management and pre-computes matrices for GhostKnockoff inference, given pre-computed correlation matrix of the variants (e.g. using reference panel in genetic studies). The output will be passed to the other functions.
GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500)
GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500)
cor.G |
The pre-computed correlation matrix of the variants. |
M |
Hypothetical number of knockoffs per variant. The default is 5 for multiple knockoff inference. |
method |
Either "sdp" or "asdp" (default: "asdp"). This determines the method that will be used to minimize the correlation between the original variables and the knockoffs. |
max.size |
The maximum number in each cluster in the "asdp" method. The default is 500. It will be ignored for "sdp". |
It returns a list that will be passed to function GhostKnockoff.fit().
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500)
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500)
This function compute study dependency gamma and the optimal weights to combine multiple studies.
GhostKnockoff.prelim.Meta(cor.study, n.study)
GhostKnockoff.prelim.Meta(cor.study, n.study)
cor.study |
The correlation among studies. |
n.study |
A vector of length K, where each element is the study sample size. |
w_opt |
Optimal weights to combine multiple studies. |
gamma |
study dependency. |
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # compute optimal weights and study dependency n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study)
# We use genetic data as an example library(GhostKnockoff) # load example vcf file from package "seqminer", this serves as the reference panel vcf.filename = system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") ## this is how the actual genotype matrix from package "seqminer" looks like example.G <- t(readVCFToMatrixByRange(vcf.filename, "1:196621007-196716634",annoType='')[[1]]) example.G <- example.G[,apply(example.G,2,sd)!=0] example.G <- example.G[,1:100] # compute correlation among variants cor.G<-matrix(as.numeric(corpcor::cor.shrink(example.G)), nrow=ncol(example.G)) # fit null model fit.prelim<-GhostKnockoff.prelim(cor.G,M=5,method='asdp',max.size=500) # compute study correlation Zscore_0<-cbind(rnorm(nrow(cor.G)),rnorm(nrow(cor.G))) # hypothetical Z-scores Zscore_0<-Zscore_0+rbinom(nrow(cor.G),size=2,0.1) # set causal cor.study<-GhostKnockoff.GetCorStudy(Zscore_0,fit.prelim) # compute optimal weights and study dependency n.study<-c(5000,7500) Meta.prelim<-GhostKnockoff.prelim.Meta(cor.study, n.study)