Title: | Gene-Based Segregation Test |
---|---|
Description: | Implements the gene-based segregation test(GESE) and the weighted GESE test for identifying genes with causal variants of large effects for family-based sequencing data. The methods are described in Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). <DOI:10.1002/gepi.22037>. Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. More details can be found at <http://scholar.harvard.edu/dqiao/gese>. |
Authors: | Dandi Qiao, Michael H. Cho |
Maintainer: | Dandi Qiao <[email protected]> |
License: | GPL-2 |
Version: | 2.0.1 |
Built: | 2024-11-20 06:37:02 UTC |
Source: | CRAN |
Implements the gene-based segregation test(GESE) and the weighted GESE test for identifying genes with causal variants of large effects for family-based sequencing data. The methods are described in Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). <DOI:10.1002/gepi.22037>. Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. More details can be found at <http://scholar.harvard.edu/dqiao/gese>.
The DESCRIPTION file:
Package: | GESE |
Type: | Package |
Title: | Gene-Based Segregation Test |
Version: | 2.0.1 |
Date: | 2017-05-17 |
Author: | Dandi Qiao, Michael H. Cho |
Maintainer: | Dandi Qiao <[email protected]> |
Description: | Implements the gene-based segregation test(GESE) and the weighted GESE test for identifying genes with causal variants of large effects for family-based sequencing data. The methods are described in Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). <DOI:10.1002/gepi.22037>. Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. More details can be found at <http://scholar.harvard.edu/dqiao/gese>. |
Depends: | kinship2 |
License: | GPL-2 |
NeedsCompilation: | yes |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
Packaged: | 2017-05-18 16:31:07 UTC; redaq |
Repository: | CRAN |
Date/Publication: | 2017-05-18 18:16:22 UTC |
Index of help topics:
GESE Gene-Based Segregation Test GESE-internal functions GESE package internal functions GESE-package Gene-Based Segregation Test condSegProbF Computes conditional segregation probability for the family dataRaw dataRaw - a data frame containing the pedigree, phenotype and genotype information database database file in example getSegInfo Computes segregation information for different mode of inheritance. mapInfo mafInfo - example data pednew pednew - an example pedigree structure trim_oneLineage Trims the pedigree structure to include one lineage only. trim_unrelated Trims the pedigree structure to exclude multiple founder cases
Further information is available in the following vignettes:
my-vignette |
An R package for Gene-based Segregation test (GESE) (source, pdf) |
computes gene-based segregation tests(GESE and weighted GESE) for family-based sequencing data. The main functions are:
GESE
: computes gene-based segregation information and GESE test p-values (unweighted and weighted version).
trim_oneLineage
: trims the pedigree so that for any subject, either the paternal family or the maternal family is included. Minimal set of sequenced subjects may be removed to ensure one lineage per pedigree only.
trim_unrelated
: trims the pedigree so that only one founder case is kept for each pedigree, and pedigrees with no cases are removed.
condSegProbF
: computes the conditional probability that a variant in the gene is segregating in the family specified, conditional on that the variant is present in the family.
Dandi Qiao, Michael H. Cho
Maintainer: Dandi Qiao <[email protected]>
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
http://scholar.harvard.edu/dqiao/gese
data(pednew) data(mapInfo) data(dataRaw) data(database) results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-2) results
data(pednew) data(mapInfo) data(dataRaw) data(database) results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-2) results
Computes the conditional probability that a variant is segregating in the family conditional on that the variant is present in one of the founders in the family.
condSegProbF(pedTemp, subjInfo)
condSegProbF(pedTemp, subjInfo)
pedTemp |
The data frame that includes the complete pedigree structure for the family |
subjInfo |
A data frame that contains the subject phenotype information for the sequenced subjects. it should include the columns FID, IID, and PHENOTYPE. |
returns the conditional segregating probability of a variant in the family
Dandi Qiao
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
data(pednew) data(mapInfo) data(dataRaw) data(database) library(kinship2) pedigrees = kinship2::pedigree(pednew$IID, pednew$faID, pednew$moID,pednew$sex,famid=pednew$FID) subjects= dataRaw[,c(1,2,6)] condSegProbF(pedigrees['93'], subjects) condSegProbF(pedigrees['412'], subjects) results2 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-2) results2$condSegProb
data(pednew) data(mapInfo) data(dataRaw) data(database) library(kinship2) pedigrees = kinship2::pedigree(pednew$IID, pednew$faID, pednew$moID,pednew$sex,famid=pednew$FID) subjects= dataRaw[,c(1,2,6)] condSegProbF(pedigrees['93'], subjects) condSegProbF(pedigrees['412'], subjects) results2 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-2) results2$condSegProb
a data frame containing the GENE and MAF information for the variants under consideration in the public reference database.
data("database")
data("database")
A data frame of 20 observations on the following 3 variables.
SNP
an unique identifier for variant
GENE
a character vector: Gene name
MAF
a numeric vector: minor allele frequency of the variants in the referecne database
A data frame containing the information for all the variants satisfying the same filtering criteria in the chosen reference genome. It should include at least three columns with these names: SNP (unique SNP ID), GENE (gene name), MAF (minor allele frequency for the variant in reference database for the corresponding population).
Randomly simulated data.
data(database)
data(database)
A data frame that can be created from the .raw
formatted filed generated by PLINK.
data("dataRaw")
data("dataRaw")
A data frame with 198 observations on the following 26 variables.
FID
Family iD
IID
Individual ID
PAT
Father ID
MAT
Mother ID
SEX
sex
PHENOTYPE
Affection status
X1
Genotype for variant 1
X2
Genotype for variant 2
X3
Genotype for variant 3
X4
Genotype for variant 4
X5
Genotype for variant 5
X6
Genotype for variant 6
X7
Genotype for variant 7
X8
Genotype for variant 8
X9
Genotype for variant 9
X10
Genotype for variant 10
The number of rows equal the number of subjects in the data and the number of columns equas the number of markers M + 6. The first six columns with specific column names include: the Family ID (FID), Individual ID (IID), father ID(PAT), mother ID (MAT), sex (SEX) and affection status (PHENOTYPE). The rest of the columns containing the genotypes for the variants listed in the coreesponding mapInfo
file. It is also important to make sure that the recoding is with respect to the minor allele in the population. The affection status of this file will be used as the phenotype.
data(dataRaw)
data(dataRaw)
Computes the gene-based segregation information and tests for family-based sequencing data.
GESE(pednew, variantInformation, dbSize, dataPed, mapInfo, threshold = 1e-7, onlySeg = FALSE, familyWeight = NA )
GESE(pednew, variantInformation, dbSize, dataPed, mapInfo, threshold = 1e-7, onlySeg = FALSE, familyWeight = NA )
pednew |
A data frame of the complete pedigree information for all families in the dataset. The required column names of this data frame include: FID (family ID), IID (individual ID, must be of class character), faID (father ID, NA if unavailable), moID (mother ID, NA if unavailable), and sex. |
variantInformation |
A data frame containing the information for all the variants satisfying the same filtering criteria in the chosen reference genome. It should include at least three columns with these names: SNP (unique SNP ID), GENE (gene name), MAF (minor allele frequency for the variant in reference database for the corresponding population). |
dbSize |
An integer indicating the sample size of the reference database used. |
dataPed |
A data frame in the |
mapInfo |
A data frame that contains at least two columns (required column names): variant ID (SNP) and Gene name (GENE). The number of rows equal to the number of SNPs/markers to be considered (M). |
threshold |
Specifies the precision needed to be reached for significant p-values. Default value is 1e-7. |
onlySeg |
True if only the segregation information (number of pedigrees segregating in each gene) is needed, else FALSE (DEFUALT), which computes the GESE p-values too. |
familyWeight |
An optional data frame. It gives the weight for the families. If it is NA, no weighting scheme is used. Otherwise, its dimenstion could be (number of families)x(number of genes+1) or (number of families)x2. The first column should be family name (column name FID). If the weights for the families are the same for all the genes, the second column should just be weight (columns name "weight"), otherwise the second column and above should be the gene names (columns names are corresponding GENE names). |
This is the main function in the GESE package. The gene-based segregation tests (GESE) described in Qiao et al (2016) is a segregation-based test extending the work of Bureau et al (2014) by computing the marginal probability of segregation events within a gene. The first step in this function is to trim the families such that only one lineage (with the most possible number of cases) is included (i.e. for any subject, only the information of either the parental pedigree or the maternal pedigree would be included). In addition, if multiple founder cases are present, remove the (smallest set of) founder(s) that are unrelated most other sequenced subjects. Then this function computes the gene-based segregating information and p-values for multiple families. If only the segregation information (number of families segregating in each gene) is needed, set onlySeg = TRUE. If different family weights will be used to boost the power, assign the weights to familyWeight parameter.
segregation |
a data frame containing the information about whether each gene is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the gene is segregating in. The number of rows equals the number of genes. Only this data frame and |
varSeg |
a data frame containing the information about whether each variant is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the variant is segregating in. The number of rows equals the number of variants. Only this data frame and |
results |
This is available when onlySeg = FALSE. The datat frame contains the columns: GENE (gene name), obs_prob (the observed segregating probability for the gene), pvalue (gene-based p-value for GESE), numSim (The number of simulations used to compute the p-value if resampling-based method is used), N_seg (the number of families that are segregating in the gene). If familyWeight is not NA, obs_weight_stat (the observed weighted test statistic) and pvalue_weighted (the p-value for the weighted test statistic) will also be returned. |
condSegProb |
A vector of length equals the number of families. The conditional probability of at least one variant in the gene is segregating in the family condition on at least one variant (among the set of variants to be considered) is present in the familiy. |
segProbGene |
A matrix of the segregating probability for the gene and for each family. This is a working matrix that could be used in other functions. |
Dandi Qiao
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
http://scholar.harvard.edu/dqiao/gese
Bureau, A., Younkin, S.G., Parker, M.M., Bailey-Wilson, J.E., Marazita, M.L., et al. (2014). Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives. Bioinformatics 30, 2189-2196. DOI:10.1093/bioinformatics/btu198.
data(pednew) data(mapInfo) data(dataRaw) data(database) results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-3) results
data(pednew) data(mapInfo) data(dataRaw) data(database) results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-3) results
GESE package internal functions.
computeP_resampling
findIntermediateFounder
findMostRecentCommonFounder
findMostRecentCommonFounderControl
getFounder
getProb
getPvalue_resampling
getTranProb_dv
isRelated
oneSetSim
segProb
getProb_weight
Dandi Qiao
Maintainer: Dandi Qiao <[email protected]>
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
Computes variant-based and gene-based segregation information for different mode of inheritance.
getSegInfo(pednew, dataPed, mapInfo, mode="recessive")
getSegInfo(pednew, dataPed, mapInfo, mode="recessive")
pednew |
A data frame of the complete pedigree information for all families in the dataset. The required column names of this data frame include: FID (family ID), IID (individual ID, must be of class character), faID (father ID, NA if unavailable), moID (mother ID, NA if unavailable), and sex. |
dataPed |
A data frame in the |
mapInfo |
A data frame of at least two columns (required column names): variant ID (SNP) and Gene name (GENE). The number of rows equal to the number of SNPs/markers to be considered (M). |
mode |
The mode of inheriance assumed to compute the segregation information. The options are "dominant", "recessive", and "CH" (compound heterozygous). The default value is "recessive". |
This function is used to compute the segregation information for different mode of inheritance without computing the GESE test. The mode of inheritance supported here are: dominant, recessive and compound heterozygous (CH). For dominant mode of inheritance, a variant is segregating if all the cases in the family carry at least one alternative allele (genotype X>0), and all the controls in the family do not carry any alternative allele (X=0). For recessive mode of inheritance, a variant is segregating if all the cases in the family carry two alternative alleles (X=2), and all the controls in the family carry less than 2 alternative alleles (X=0 or X=1). For compound heterozygous mode of inheritance, a variant is segregating at two variant position if all the cases in the family carry at least one alternative allele at the two positions (X1>0 and X2>0), and all the controls in the family do not carry any alternative allele at either of the two positions (X1 = 0 or X2 = 0).
varSeg |
For dominant and recessive mode of inheriancce, this is a data frame containing the information about whether each variant is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the variant is segregating in. The number of rows equals the number of variants. For compound heterozygous mode of inheritance, this is a data frame containing the information of whether each pair of variants is segregating in each of the families. We consider all pairs in the dataset, if the pair of variants are not included in this data frame, they are not segregating in any families. |
geneSeg |
For dominant and recessive mode of inheriancce, this is a data frame containing the information about whether each gene is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the gene is segregating in. The number of rows equals the number of genes. For compound heterozygous mode of inheritance, this is a data frame containing the information of whether any pair of variants in this gene are segregating in each of the families. The last columns is the number of families with the presence of any pair of variants segregating in the gene. |
genePairSeg |
This data frame is returned only for compound heterozygous mode of inheritance. This considers any pair of genes in the data. It returns a data frame containing the information of whether any pair of variants, each in a different gene, is segregating in each of the families considered. Each row represents the information for each gene pair, summed over all possible pairs of variants in the two genes, one in each gene. |
Dandi Qiao
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
data(pednew) data(mapInfo) data(dataRaw) data(database) result <- getSegInfo(pednew, dataRaw, mapInfo) result$varSeg result$geneSeg result <- getSegInfo(pednew, dataRaw, mapInfo, mode="recessive") result$varSeg result$geneSeg result <- getSegInfo(pednew, dataRaw, mapInfo, mode="CH") result$varSeg result$geneSeg result$genePairSeg
data(pednew) data(mapInfo) data(dataRaw) data(database) result <- getSegInfo(pednew, dataRaw, mapInfo) result$varSeg result$geneSeg result <- getSegInfo(pednew, dataRaw, mapInfo, mode="recessive") result$varSeg result$geneSeg result <- getSegInfo(pednew, dataRaw, mapInfo, mode="CH") result$varSeg result$geneSeg result$genePairSeg
a data frame containing the gene information for the variants in the study.
data("mapInfo")
data("mapInfo")
A data frame of 20 observations on the following 2 variables.
GENE
The gene name
SNP
An unique SNP identifier
data(mapInfo)
data(mapInfo)
A data frame of the complete pedigree strucutre for the families included
data("pednew")
data("pednew")
A data frame of 1700 observations on the following 26 variables.
FID
Family ID of class character
IID
Individual ID of class character
faID
Father ID, NA if missing
moID
Mother ID, NA if missing
sex
Sex, 1 for male, 2 for female and NA if missing.
data(pednew)
data(pednew)
Trims the families to include only one lineage.
trim_oneLineage(seqSub, pednew)
trim_oneLineage(seqSub, pednew)
seqSub |
A data frame that should include three columns FID (family ID), IID (individual ID), and PHENOTYPE (affection status) for the sequenced subjects in the data. One example is the 1st, 2nd and 6th columns from the plink raw format. |
pednew |
A data frame includes the complete pedigree structure information for all sequenced families in the dataset. The required column names of this data frame include: FID (family ID), IID (individual ID, must be of class character), faID (father ID, NA if unavailable), moID (mother ID, NA if unavailable), and sex. |
For each subject, only the maternal or the paternal family is included, since the rare variant should be present in only the related subjects. The lineage with the maximal set of sequenced cases will be used as the final pedigree.
pedInfoUpdate |
the complete pedigrees with only the paternal or maternal lineage |
seqSubjUpdate |
The sequenced subjects that are in the selected lineage are returned for the rest of the analysis. |
This function can be used for other analysis of family-based data processing. For example, the pre-processing step for PVAAST analysis.
Dandi Qiao
Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.
data(pednew) data(mapInfo) data(dataRaw) data(database) subjects <- dataRaw[,c(1:2, 6)] cat("Trimming the families...\n") cat("Trimming step 1: keep only one lineage \n") trim <- trim_oneLineage(seqSub=subjects, pednew)
data(pednew) data(mapInfo) data(dataRaw) data(database) subjects <- dataRaw[,c(1:2, 6)] cat("Trimming the families...\n") cat("Trimming step 1: keep only one lineage \n") trim <- trim_oneLineage(seqSub=subjects, pednew)