Title: | Inference of Ploidy from (Genotyping-by-Sequencing) GBS Data |
---|---|
Description: | Functions for inference of ploidy from (Genotyping-by-sequencing) GBS data, including a function to infer allelic ratios and allelic proportions in a Bayesian framework. |
Authors: | Zachariah Gompert |
Maintainer: | Zachariah Gompert <[email protected]> |
License: | GPL-3 |
Version: | 1.0 |
Built: | 2024-12-10 06:46:13 UTC |
Source: | CRAN |
Functions for inference of ploidy from (Genotyping-by-sequencing) GBS data, including a function to infer allelic ratios and allelic proportions in a Bayesian framework.
The DESCRIPTION file:
Package: | gbs2ploidy |
Type: | Package |
Title: | Inference of Ploidy from (Genotyping-by-Sequencing) GBS Data |
Version: | 1.0 |
Date: | 2016-12-01 |
Author: | Zachariah Gompert |
Maintainer: | Zachariah Gompert <[email protected]> |
Depends: | R (>= 2.10), MASS, rjags |
Description: | Functions for inference of ploidy from (Genotyping-by-sequencing) GBS data, including a function to infer allelic ratios and allelic proportions in a Bayesian framework. |
License: | GPL-3 |
NeedsCompilation: | no |
Packaged: | 2016-12-01 19:46:38 UTC; zgompert |
Repository: | CRAN |
Date/Publication: | 2016-12-01 21:32:40 |
Config/pak/sysreqs: | jags |
Index of help topics:
dat Simulated allele counts estploidy Discriminate cytotypes using GBS data estprops Estimate allelic proportions gbs2ploidy-package Inference of Ploidy from (Genotyping-by-Sequencing) GBS Data
A typical analysis will begin by estimating allelic proportions using the estprops
function. This is done in a Bayesian framework and is the most computationally intensive part of the analysis (i.e., depending on the size of the data set, this might take a day or more). This function depends on rjags
, which means the user needs to install the stand-alone program JAGS
as well. Principal component analysis and discriminant analysis are then used to obtain cytotype assignment probabilities via the estploidy
function. This can be done with or without a training set of individuals with known ploidies.
Zachariah Gompert
Maintainer: Zachariah Gompert <[email protected]>
Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=2) ## calculate observed heterozygosity and depth of coverage from the allele count ## data hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean) dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE) ## run estploidy without using known ploidy data pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2, ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities by true ploidy ## (which is known because these are simulated data) boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy") ## run estploidy with a training data set with known ploidy; the data set is ## split into 100 individuals with known ploidy and 100 that are used for ## inference truep<-dat[[3]] trn<-sort(sample(1:200,100,replace=FALSE)) truep[-trn]<-NA plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn, nclasses=2,ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities for individuals that ## were not part of the training set by true ploidy (which is known because ## these are simulated data) boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy") ## End(Not run)
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=2) ## calculate observed heterozygosity and depth of coverage from the allele count ## data hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean) dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE) ## run estploidy without using known ploidy data pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2, ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities by true ploidy ## (which is known because these are simulated data) boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy") ## run estploidy with a training data set with known ploidy; the data set is ## split into 100 individuals with known ploidy and 100 that are used for ## inference truep<-dat[[3]] trn<-sort(sample(1:200,100,replace=FALSE)) truep[-trn]<-NA plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn, nclasses=2,ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities for individuals that ## were not part of the training set by true ploidy (which is known because ## these are simulated data) boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy") ## End(Not run)
dat
is a simulated data set meant to mimic GBS data. It is a list with three components. The first two components are N (number of individuals) by P (number of SNPs) matrixes with allele counts for the first and second allele at each locus, respectively. The third component is a numeric vector that gives the true ploidy for each individual (2 = diploid, 4 = tetraploid).
data("dat")
data("dat")
The format is: List of 3 $ : int [1:200, 1:10000] NA NA 7 4 NA NA NA NA NA NA ... $ : int [1:200, 1:10000] NA NA 2 5 NA NA NA NA NA NA ... $ : num [1:200] 4 2 4 2 2 2 2 2 2 2 ...
data(dat) str(dat)
data(dat) str(dat)
Use principal component analysis (PCA) and discriminant analysis (DA) to assign individuals to different cytotype groups based on GBS data. This function can be run with or without a training set of individuals with known cytotypes.
estploidy(alphas = NA, het = NA, depth = NA, train = FALSE, pl = NA, set = NA, nclasses = 2, ids = NA, pcs = 1:2)
estploidy(alphas = NA, het = NA, depth = NA, train = FALSE, pl = NA, set = NA, nclasses = 2, ids = NA, pcs = 1:2)
alphas |
a list generated by the |
het |
a numeric vector with the observed heterozygosity for each individual, that is the proportion of SNPs at which the individual was heterozygous. |
depth |
a numeric vector with the mean number of reads per SNP for each individual (all SNPs or just heterozygous SNPs). |
train |
a boolean specifying whether or not a training set with known ploidy should be used. |
pl |
a vector of known ploidies with one entry per individual (use ‘NA’ for individuals with unknown ploidy); only used if |
set |
indixes for the training set; only used if |
nclasses |
the number of cyotypes expected. |
ids |
names or other IDs for individuals. |
pcs |
a vector giving the PC to use for DA. |
Assignment probabilities to different cytotype groups are obtained using PCA and DA, as described in Gompert & Mock (XXXX). Residual hetorzygosity (from regressing het
on depth
) and point estimates (posterior medians) for allelic proportions are first ordinated using PCA (on the centered covariance matrix). The first pcs
PCs are retained for DA. If a training set is not provided, k-means clustering is used to obtain initial classifications (with nclasses
groups), which are then used for leave-one-out cross-validation with DA. In this case, PC loadings and discriminant scores should be evaluated to associate groups from k-means clustering with specific cytotypes (Gompert & Mock, XXXX). If a training set is provided, it is used to define groups for DA and to estimate assignment probabilities without k-means clustering.
estploidy
returns a list with three components:
pp |
A matrix with assignment probabilities for each individual (rows) to each group (columns); the first column gives the |
pcwghts |
A matrix with the variable loadings (PC weights) from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first). |
pcscrs |
A matrix of PC scores from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first). |
Zachariah Gompert
Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=1) ## calculate observed heterozygosity and depth of coverage from the allele count ## data hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean) dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE) ## run estploidy without using known ploidy data pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2, ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities by true ploidy ## (which is known because these are simulated data) boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy") ## run estploidy with a training data set with known ploidy; the data set is ## split into 100 individuals with known ploidy and 100 that are used for ## inference truep<-dat[[3]] trn<-sort(sample(1:200,100,replace=FALSE)) truep[-trn]<-NA plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn, nclasses=2,ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities for individuals that ## were not part of the training set by true ploidy (which is known because ## these are simulated data) boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy") ## End(Not run)
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=1) ## calculate observed heterozygosity and depth of coverage from the allele count ## data hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean) dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE) ## run estploidy without using known ploidy data pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2, ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities by true ploidy ## (which is known because these are simulated data) boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy") ## run estploidy with a training data set with known ploidy; the data set is ## split into 100 individuals with known ploidy and 100 that are used for ## inference truep<-dat[[3]] trn<-sort(sample(1:200,100,replace=FALSE)) truep[-trn]<-NA plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn, nclasses=2,ids=dat[[3]],pcs=1:2) ## boxplots to visualize posterior assignment probabilities for individuals that ## were not part of the training set by true ploidy (which is known because ## these are simulated data) boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy") ## End(Not run)
This functions uses Markov chain Monte Carlo to obtain Bayesian estimates of allelic proportions, which denote that proportion of heterozygous GBS SNPs with different allelic ratios.
estprops(cov1 = NA, cov2 = NA, props = c(0.25, 0.33, 0.5, 0.66, 0.75), mcmc.nchain = 2, mcmc.steps = 10000, mcmc.burnin = 1000, mcmc.thin = 2)
estprops(cov1 = NA, cov2 = NA, props = c(0.25, 0.33, 0.5, 0.66, 0.75), mcmc.nchain = 2, mcmc.steps = 10000, mcmc.burnin = 1000, mcmc.thin = 2)
cov1 |
a P (number of SNPs) by N (number of individuals) matrix with read counts for the first allele (e.g., the non-reference allele). Numeric values should be provided for heterygous SNPs only, homozygous SNPs should be coded as missing data (i.e., ‘NA’). |
cov2 |
a P (number of SNPs) by N (number of individuals) matrix with read counts for second allele (e.g., the reference allele). Numeric values should be provided for heterygous SNPs only, homozygous SNPs should be coded as missing data (i.e., ‘NA’). |
props |
a vector containing valid allelic proportions given the expected cyotypes present in the sample. |
mcmc.nchain |
number of chains for MCMC. |
mcmc.steps |
number of post burnin iterations for each chain. |
mcmc.burnin |
number of iterations to discard from each chain as a burnin. |
mcmc.thin |
thinning interval for MCMC. |
Allelic proportions are inferred from the allele counts based on the Bayesian model described in Gompert \& Mock (XXXX). Please consult this publication for a detailed description of the model. Users can modify the vector of possible allelic proportions based on expectations for their data set. For example, true allelic proportions for diploids, triploids and tetraploids are 1:1 (0.5), 1:2 or 2:1 (0.33 or 0.66), and 1:3, 2:2, or 3:1 (0.25, 0.5, or 0.75), respectively.
estprops
returns a list with one component per individual. Components summarize the posterior distributions for allelic proportions. Rows correspond to different allelic proportions (as defined by ‘props’) and columns give the 2.5th, 25th, 50th, 75th, and 97.5th quantiles of the posterior distribution for each parameter.
Zachariah Gompert
Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=1) ## plot point estimates and 95 ## allelic proportions for the first nine individuals par(mfrow=c(3,3)) for(i in 1:9){ plot(props[[i]][,3],ylim=c(0,1),axes=FALSE,xlab="ratios",ylab="proportions") axis(1,at=1:5,c("1:3","1:2","1:1","2:1","3:1")) axis(2) box() segments(1:5,props[[i]][,1],1:5,props[[i]][,5]) title(main=paste("true ploidy =",dat[[3]][i])) } ## End(Not run)
## load a simulated data set data(dat) ## Not run: ## obtain posterior estimates of allelic proportions; short chains are used for ## the example, we recommend increasing this to at least 1000 MCMC steps with a ## 500 step burnin props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5, mcmc.thin=1) ## plot point estimates and 95 ## allelic proportions for the first nine individuals par(mfrow=c(3,3)) for(i in 1:9){ plot(props[[i]][,3],ylim=c(0,1),axes=FALSE,xlab="ratios",ylab="proportions") axis(1,at=1:5,c("1:3","1:2","1:1","2:1","3:1")) axis(2) box() segments(1:5,props[[i]][,1],1:5,props[[i]][,5]) title(main=paste("true ploidy =",dat[[3]][i])) } ## End(Not run)