Title: | Computing f-Statistics and Building Admixture Graphs Based on Allele Count or Pool-Seq Read Count Data |
---|---|
Description: | Functions for the computation of F-, f- and D-statistics (e.g., Fst, hierarchical F-statistics, Patterson's F2, F3, F3*, F4 and D parameters) in population genomics studies from allele count or Pool-Seq read count data and for the fitting, building and visualization of admixture graphs. The package also includes several utilities to manipulate Pool-Seq data stored in standard format (e.g., such as 'vcf' files or 'rsync' files generated by the the 'PoPoolation' software) and perform conversion to alternative format (as used in the 'BayPass' and 'SelEstim' software). As of version 2.0, the package also includes utilities to manipulate standard allele count data (e.g., stored in TreeMix, BayPass and SelEstim format). |
Authors: | Mathieu Gautier [aut, cre] |
Maintainer: | Mathieu Gautier <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.0.0 |
Built: | 2024-11-25 16:23:58 UTC |
Source: | CRAN |
Test all possible connection of a leaf to a graph with non-admixed and or admixed edges
add.leaf( x, leaf.to.add, fstats, only.test.non.admixed.edges = FALSE, only.test.admixed.edges = FALSE, verbose = TRUE, ... )
add.leaf( x, leaf.to.add, fstats, only.test.non.admixed.edges = FALSE, only.test.admixed.edges = FALSE, verbose = TRUE, ... )
x |
An object of class graph.params or fitted.graph (see details) |
leaf.to.add |
Name of the leaf to add |
fstats |
Object of class fstats that contains estimates of the fstats (see compute.fstats) |
only.test.non.admixed.edges |
If TRUE the function only test non.admixed edges (may be far faster) |
only.test.admixed.edges |
If TRUE the function only test admixed edges |
verbose |
If TRUE extra information is printed on the terminal |
... |
Some parameters to be passed the function fit.graph called internally |
The input object x needs to be of class graph.params (as generated by the function generate.graph.params) or fitted.graph (as generated by the function fit.graph or by the function add.leaf itself in the graphs.fit.res elements of the output list). This is to ensure that the matrix describing the structure of the graph (graph slot of these objects) is valid (note that it can be plotted for checks). Hence graph.params objects may have been generated without fstats information (that should be supplied independently to the add.leaf function to obtain information on the fstats involving the candidate leaf defined with the leaf.to.add argument). By default the function tests all the possible positions of a newly added edge connecting the candidate leaf to the graph with both non-admixed (including a new rooting with the candidate leaf as an outgroup) and admixed edges. If n_e is the the number of non-admixed edges of the original graph, the number of tested graphs for non-admixed edges equals n_e+1. The newly added node is named "N-"[name of the leaf to add] (or with more N if the name already exists). For admixed edges, the number of tested graphs equals n_e*(n_e-1)/2 and for a given tested graph, three nodes named "S-"[name of the leaf to add], "S1-"[name of the leaf to add] and "S2-"[name of the leaf to add] (or with more S if the name already exists) are added and the admixture proportions are named with a letter (A to Z depending on the number of admixed nodes already present in the graph).
A list with the following elements:
"n.graphs": The number of tested graphs
"fitted.graphs.list": a list of fitted.graph objects (indexed from 1 to n.graphs and in the same order as the list "graphs") containing the results of fitting of each graph.
"best.fitted.graph": The graph (object of class fitted.graph) with the minimal BIC (see function fit.graph) among all the graphs within fitted.graphs.list
"bic": a vector of the n.graphs BIC (indexed from 1 to n.graphs and in the same order as the "fitted.graphs.list" list) (see fit.graph details for the computation of the scores).
see fit.graph
and generate.graph.params
.
Compute the block-jackknife covariance between two stats
stat1 |
Vector of block-jackknife values for the first stat |
stat2 |
Vector of block-jackknife values for the second stat |
Compute the block-jackknife covariance between two stats with correction
Covariance values
#
#
Sugar to compute the sum of a stat per block
.block_sum(stat, snp_bj_id)
.block_sum(stat, snp_bj_id)
stat |
vector of n stat values |
snp_bj_id |
integer n-length vector with block index (from 0 to nblock-1) of the stat value |
Sugar to compute the sum of a stat per block
Return a vector of length nblocks containing the per-block sums of the input stat
#
#
Sugar to compute the sum of a stat per block defined by a range of SNPs (allow treating overlapping blocks)
.block_sum2(stat, snp_bj_id)
.block_sum2(stat, snp_bj_id)
stat |
vector of n stat values |
snp_bj_id |
integer matrix of dim nblocks x 2 giving for each block the start and end stat value index |
Sugar to compute the sum of a stat per block defined by a range of SNPs (allow treating overlapping blocks)
Return a vector of length nblocks containing the per-block sums of the input stat
#
#
Compare fitted f2, f3 and f4 f-statistics of an admixture graph with estimated ones
compare.fitted.fstats(fstats, fitted.graph, n.worst.stats = 5)
compare.fitted.fstats(fstats, fitted.graph, n.worst.stats = 5)
fstats |
Object of class fstats containing estimates of fstats (as obtained with compute.fstats) |
fitted.graph |
Object of class fitted graph (as obtained with fit.graph function). |
n.worst.stats |
The number of worst statistics to be displayed in the terminal |
Compare fitted and estimated f-statistics may allow identifying problematic edges on the graph.
A matrix with 3 columns for each test (row names of the matrix corresponding to the test):
The estimated f-statistics (mean across block-Jackknife samples)
The fitted f-statistics (obtained from the fitted grah parameters
A Z-score measuring the deviation of the fitted values from the estimated values in units of standard errors (i.e., Z=(fitted.value-target.value)/se(target.value))
See compute.fstats
and fit.graph
Compute the denominator of the Dstat for all quadruplet configuration and each block-jackknife block (if any) and overall SNPs (within or outside blocks)
.compute_blockDdenom(refcount, totcount, nblocks, block_id, verbose)
.compute_blockDdenom(refcount, totcount, nblocks, block_id, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
nblocks |
Integer giving the number of block-jackknife blocs (may be 0 if no block-jackknife) |
block_id |
Integer vector of length nsnps with the (0-indexed) id of the block to which each SNP belongs (-1 for SNPs outside blocks) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute the denominator of the Dstat for all quadruplet configuration and each block-jackknife block (if any) and overall SNPs (within or outside blocks)
Return a matrix with nf4=(npops*(npops-1)/2)*((npops-2)*(npops-3)/2)/2 rows and nblocks+1 columns giving the mean Dstat-denominator (1-Q2ab)(1-Q2cd) for all quadruplet configuration and within each block-jackknife sample and over all SNPs (last column)
#
#
Compute all F3 from overall F2 values
.compute_F3fromF2(F2val, Hval, npops)
.compute_F3fromF2(F2val, Hval, npops)
F2val |
Numeric vector of length nF2=(npop*(npop-1))/2 with all pairwise F2 estimates |
Hval |
Numeric vector of length npop with all within pop heterozygosity estimates |
npops |
Integer giving the number of populations |
Compute F3 and F3star estimates from F2 (and heterozygosities)
Return a matrix of length nF3=npops*(npops-1)*(npops-2)/2 rows and 2 columns corresponding to the F3 and F3star estimates
#
#
Compute all F3 from F2 values obtained from each block-jackknife bloc
.compute_F3fromF2samples(blockF2, blockHet, npops, verbose)
.compute_F3fromF2samples(blockF2, blockHet, npops, verbose)
blockF2 |
Numeric Matrix with nF2=(npop*(npop-1))/2 rows and nblocks columns matrix containing pairwise-pop F2 estimates for each block-jackknife sample (l.o.o.) |
blockHet |
Numeric Matrix with npop rows and nblocks columns containing all within pop heterozygosity estimates for each block-jackknife sample (l.o.o.) |
npops |
Integer giving the number of populations |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute F3 and F3star estimates and their s.e. based on block-jackknife estimates of all F2 (and heterozygosities)
Return a matrix with nF3=npops*(npops-1)*(npops-2)/2 rows and four columns corresponding to the mean and the s.e. of F3 and the mean and s.e. of F3star
#
#
Compute all F4 and Dstat from F2 values obtained from each block-jackknife bloc
.compute_F4DfromF2samples(blockF2, blockDenom, npops, verbose)
.compute_F4DfromF2samples(blockF2, blockDenom, npops, verbose)
blockF2 |
Numeric Matrix with nF2=(npop*(npop-1))/2 rows and nblocks columns matrix containing pairwise-pop F2 estimates for each block-jackknife sample (l.o.o.) |
blockDenom |
Numeric Matrix with nF4=(npops*(npops-1)/2)*((npops-2)*(npops-3)/2)/2 rows and nblocks containing the estimates of the denominator of Dstat (see compute_blockDdenom) for each block-jackknife sample (l.o.o.) |
npops |
Integer giving the number of populations |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute F4 and D estimates and their s.e. based on block-jackknife estimates of all F2 (and heterozygosities)
Return a matrix with nF4=(npops*(npops-1)/2)*((npops-2)*(npops-3)/2)/2 rows and four columns corresponding to the mean and the s.e. of F4 and the mean and s.e. of Dstat
#
#
Compute all F4 from overall F2 and Q2 values
.compute_F4fromF2(F2val, npops)
.compute_F4fromF2(F2val, npops)
F2val |
Numeric vector of length nF2=(npop*(npop-1))/2 with all pairwise F2 estimates |
npops |
Integer giving the number of populations |
Compute F4 from F2 (and heterozygosities)
Return a vector of length nF4=(npops*(npops-1)/2) * ((npops-2)*(npops-3)/2) / 2 rows corresponding to all the F4 estimates for all possible configurations
#
#
Compute all F4 from F2 values obtained from each block-jackknife bloc
.compute_F4fromF2samples(blockF2, npops, verbose)
.compute_F4fromF2samples(blockF2, npops, verbose)
blockF2 |
Numeric Matrix with nF2=(npop*(npop-1))/2 rows and nblocks columns matrix containing pairwise-pop F2 estimates for each block-jackknife sample (l.o.o.) |
npops |
Integer giving the number of populations |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute F4 estimates and their s.e. based on block-jackknife estimates of all F2 (and heterozygosities)
Return a matrix with nF4=(npops*(npops-1)/2) * ((npops-2)*(npops-3)/2) / 2 rows and two columns corresponding to the mean and the s.e. of F4 estimates for all possible configurations
#
#
Compute (uncorrected) 1-Q1 for each block-jackknife block (if any) and over all the SNPs (i.e., either within or outside blocks)
.compute_H1(refcount, totcount, nblocks, block_id, verbose)
.compute_H1(refcount, totcount, nblocks, block_id, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
nblocks |
Integer giving the number of block-jackknife blocs (may be 0 if no block-jackknife) |
block_id |
Integer vector of length nsnps with the (0-indexed) id of the block to which each SNP belongs (-1 for SNPs outside blocks) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute all the (uncorrected) H1=1-Q1 for each block-jackknife block (if any) and overall SNPs (within or outside blocks). It is indeed more convenient to compute H1 (rather than Q1) to apply corrections afterwards within R function
Return a matrix with npops rows and nblocks+1 column giving the mean H1 of each pop within each block and for all SNPs (last column)
#
#
Compute all Q2 for each block-jackknife block (if any) and overall SNPs (within or outside blocks)
.compute_Q2(refcount, totcount, nblocks, block_id, verbose)
.compute_Q2(refcount, totcount, nblocks, block_id, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
nblocks |
Integer giving the number of block-jackknife blocs (may be 0 if no block-jackknife) |
block_id |
Integer vector of length nsnps with the (0-indexed) id of the block to which each SNP belongs (-1 for SNPs outside blocks) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute all Q2 for each block-jackknife block (if any) and overall SNPs (within or outside blocks).
Return a matrix with npops*(npops-1)/2 and nblocks+1 column giving the mean Q2 of each pairwise pop comp. within each block and for all SNPs (last column)
#
#
Compute the Qmat matrix (error covariance between all F2 and F3 measures) from F2 block-jackknife estimates
.compute_QmatfromF2samples(blockF2, npops, verbose)
.compute_QmatfromF2samples(blockF2, npops, verbose)
blockF2 |
Numeric Matrix with nF2=(npop*(npop-1))/2 rows and nblocks columns matrix containing pairwise-pop F2 estimates for each block-jackknife sample (l.o.o.) |
npops |
Integer giving the number of populations |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute the error covariance matrix Qmat (between all F2 and F3 measures) from F2 block-jackknife estimates (by recomuting all F3 for all blocks)
Return the (nF2+nF3)*(nF2+nF3) error covariance (symmetric) matrix
#
#
Compute SNP-specific MSG, MSP and nc used to derived the Anova estimator of Fst for allele count or read count data (Pool-Seq)
.compute_snpFstAov(refcount, totcount, hapsize, verbose)
.compute_snpFstAov(refcount, totcount, hapsize, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
hapsize |
Vector of length npop giving the haploid size of each pool (if one element <=0, counts are interpreted as count data) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute SNP-specific Q1 and Q2 based on Anova estimator of Fst for allele count or read count data (Pool-Seq). For allele count data, the implemented estimator corresponds to that described in Weir, 1996 (eq. 5.2) For read (Pool-Seq) data, the implemented estimator corresponds to that described in Hivert et al., 2016
Return a nsnpsx3 matrix with SNP-specific MSG, MSP and nc
#
#
Compute SNP-specific MSI, MSP, MSG, nc, nc_p and nc_pp used to derived the Anova estimator of hier. Fst for allele count or read count data (Pool-Seq)
.compute_snpHierFstAov(refcount, totcount, hapsize, popgrpidx, verbose)
.compute_snpHierFstAov(refcount, totcount, hapsize, popgrpidx, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
hapsize |
Vector of length npop giving the haploid size of each pool (if one element <=0, counts are interpreted as count data) |
popgrpidx |
Vector of length npop giving the index (coded from 0 to ngrp-1) of the group of origin |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute SNP-specific MSI, MSP, MSG, nc, nc_p and nc_pp used to derived the Anova estimator of hier. Fst for allele count or read count data (Pool-Seq)
Return a nsnpsx6 matrix with SNP-specific MSI, MSP, MSG, nc, nc_p and nc_pp
#
#
Compute SNP-specific Q1 by averaging over all samples
.compute_snpQ1(refcount, totcount, weight, verbose)
.compute_snpQ1(refcount, totcount, weight, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
weight |
Vector of length npop giving the weighting scheme (w=1 for allele count data and w=poolsize/(poolsize-1) for PoolSeq data) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute all the SNP-specific Q1 over all pop. samples (useful for Fst computation with method Identity).
Return a vector of length nsnps with SNP-specific Q1
#
#
Compute SNP-specific Q1 for one pop
.compute_snpQ1onepop(refcount, totcount, weight)
.compute_snpQ1onepop(refcount, totcount, weight)
refcount |
Vector of nsnp counts (genotype or reads) for the reference allele |
totcount |
Vector of nsnp total counts or read coverages |
weight |
Numeric (w=1 for allele count data and w=poolsize/(poolsize-1) for PoolSeq data) |
Compute SNP-specific Q1 for one pop. samples.
Return a vector of length nsnps with SNP-specific Q1
#
#
Compute SNP-specific Q1 over all samples using weighting averages of pop. Q1 (eq. A46 in Hivert et al., 2018)
.compute_snpQ1rw(refcount, totcount, weight, sampsize, readcount, verbose)
.compute_snpQ1rw(refcount, totcount, weight, sampsize, readcount, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
weight |
Vector of length npop giving the weighting scheme (w=1 for allele count data and w=poolsize/(poolsize-1) for PoolSeq data) |
sampsize |
Vector of length npop giving the haploid sample size (not used for count data) |
readcount |
Logical (if TRUE PoolSeq data assumed i.e. weights depending on haploid size, otherwise weights depend on total counts) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute all the SNP-specific Q1 over all pop. samples using weighting averages of pop. Q1 as in eq. A46 of Hivert et al., 2018 (useful for Fst computation with method Identity).
Return a vector of length nsnps with SNP-specific Q1
#
#
Compute SNP-specific Q2 by averaging over all pairs of samples
.compute_snpQ2(refcount, totcount, pairs, verbose)
.compute_snpQ2(refcount, totcount, pairs, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
pairs |
Matrix of npoppairsx2 giving the index for all the pairs of pops included in the computation |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute all the SNP-specific Q2 over all pop. pairs (useful for Fst computation with method Identity).
Return a vector of length nsnps with SNP-specific Q2
#
#
Compute SNP-specific Q2 for a single pair of samples
.compute_snpQ2onepair(refcount1, refcount2, totcount1, totcount2)
.compute_snpQ2onepair(refcount1, refcount2, totcount1, totcount2)
refcount1 |
Vector of count (genotype or reads) for the reference allele in the first sample |
refcount2 |
Vector of count (genotype or reads) for the reference allele in the second sample |
totcount1 |
Vector of total count or read coverages in the first sample |
totcount2 |
Vector of total count or read coverages in the second sample |
Compute SNP-specific Q2 for a single pair of samples
Return a vector of length nsnps with SNP-specific Q1
#
#
Compute SNP-specific Q2 by averaging over all pairs of samples using weighting averages of pairwise Q2 (eq. A47 in Hivert et al., 2018)
.compute_snpQ2rw(refcount, totcount, pairs, sampsize, readcount, verbose)
.compute_snpQ2rw(refcount, totcount, pairs, sampsize, readcount, verbose)
refcount |
Matrix of nsnpxnpop with counts (genotype or reads) for the reference allele |
totcount |
Matrix of nsnpxnpop with total counts or read coverages |
pairs |
Matrix of npoppairsx2 giving the index for all the pairs of pops included in the computation |
sampsize |
Vector of length npop giving the haploid sample size (not used for count data) |
readcount |
Logical (if TRUE PoolSeq data assumed i.e. weights depending on haploid size, otherwise weights depend on total counts) |
verbose |
Logical (if TRUE progression bar is printed on the terminal) |
Compute SNP-specific Q2 by averaging over all pairs of samples using weighting averages of pairwise Q2 (eq. A47 in Hivert et al., 2018) (useful for Fst computation with method Identity).
Return a vector of length nsnps with SNP-specific Q2
#
#
Compute F4ratio (estimation of admixture rate) from an fstats object
compute.f4ratio(x, num.quadruplet, den.quadruplet)
compute.f4ratio(x, num.quadruplet, den.quadruplet)
x |
A fstats object containing estimates of fstats |
num.quadruplet |
A character string for the F4 quadruplet used in the F4ratio numerator (should be of the form "A,O;C,X" where A, O, C and X are the names of the population as defined in the countdata or pooldata object used to obtain fstats, see details) |
den.quadruplet |
A character string for the F4 quadruplet used in the F4ratio denominator (should be of the form "A,O;C,B" where A, O, C and B are the names of the populations as defined in the countdata or pooldata object used to obtain fstats, see details)) |
Assuming a 4 population phylogeny rooted with an outgroup O of the form (((A,B);C);O) and an admixed population X with two source populations related to B and C, the admixture rate alpha of the B-related ancestry is obtained using the ratio F4(A,O;C,X)/F4(A,O;C,B) (see Patterson et al., 2012 for more details).
A vector with 5 elements corresponding. The first element is always the estimated value. If F2 block-jackknife samples are available in the input fstats object (i.e., compute.fstats was run with return.F2.blockjackknife.samples = TRUE), the four other elements are the block-jackknife mean; the block-jackknife s.e.; and the lower and upper bound of the 95
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata)
Estimate the F-statistics (F2, F3, F3star, F4, Dstat) and within and across population diversity
compute.fstats( x, nsnp.per.bjack.block = 0, computeDstat = FALSE, computeF3 = TRUE, computeF4 = TRUE, output.pairwise.fst = TRUE, output.pairwise.div = TRUE, computeQmat = TRUE, return.F2.blockjackknife.samples = FALSE, return.F4.blockjackknife.samples = FALSE, verbose = TRUE )
compute.fstats( x, nsnp.per.bjack.block = 0, computeDstat = FALSE, computeF3 = TRUE, computeF4 = TRUE, output.pairwise.fst = TRUE, output.pairwise.div = TRUE, computeQmat = TRUE, return.F2.blockjackknife.samples = FALSE, return.F4.blockjackknife.samples = FALSE, verbose = TRUE )
x |
A pooldata object containing Pool-Seq information or a countdata object containing allele count information |
nsnp.per.bjack.block |
Number of consecutive SNPs within a block for block-jackknife (default=0, i.e., no block-jackknife sampling) |
computeDstat |
If TRUE compute Dstatistics (i.e. scaled F4). This may add some non negligible computation time if the number of population is large (n>15) |
computeF3 |
If TRUE (default) compute all F3 and all F3star (i.e. scaled F3). |
computeF4 |
If TRUE (default) compute all F4. |
output.pairwise.fst |
If TRUE (default), output the npopxnpop matrix of pairwise-population Fst estimates (corresponding to the "Identity" method implemented in |
output.pairwise.div |
If TRUE (default), output the npopxnpop matrix of pairwise-population divergence (1-Q2) estimates in the pairwise.div slot of the fstats output object (see help(fstats) for details) that may be visualized with e.g. heatmap function or used with a clustering function (e.g., hclust). |
computeQmat |
If TRUE, compute the error covariance matrix between all F3 and F2 statistics (needed for admixture graph construction). This matrix may be very large if the number of pops is large. It is recommended to estimate it on a reduced sample of pops. |
return.F2.blockjackknife.samples |
If TRUE (and nsnp.per.bjack.block>0) return an array of dimension (npopxnpopxnblocks) in an admixtools2 compatible format |
return.F4.blockjackknife.samples |
Deprecated options (since v. 2.2.0) |
verbose |
If TRUE extra information is printed on the terminal |
The function estimates for the n populations (or pools) represented in the input object x:
The F2 statistics for all the pairs of populations (or pools) and their scaled version (equivalent, but faster, than Fst estimated with
compute.pairwiseFST
when method="Identity")
If n>2, The F3 statistics for all the possible triplets of populations (or pools) and their scaled version (named F3star after Patterson et al., 2012)
If n>3, The F4 statistics and the D-statistics (a scaled version of the F4) for all the possible quadruplets of populations
The estimated within population heterozygosities (=1-Q1)
The estimated divergence for each pair of populations (=1-Q2)
An object of class fstats (see help(fstats) for details)
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata)
Compute pairwise population population FST matrix (and possibly all pairwise SNP-specific FST)
compute.pairwiseFST( x, method = "Anova", min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.indgeno.per.pop = -1, min.maf = -1, output.snp.values = FALSE, nsnp.per.bjack.block = 0, verbose = TRUE )
compute.pairwiseFST( x, method = "Anova", min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.indgeno.per.pop = -1, min.maf = -1, output.snp.values = FALSE, nsnp.per.bjack.block = 0, verbose = TRUE )
x |
A pooldata object containing Pool-Seq information or a countdata object containing allele count information |
method |
Either "Anova" (default method as described in the manuscript) or "Identity" (relies on an alternative modeling consisting in estimating unbiased Probability of Identity within and across pairs of pools) |
min.cov.per.pool |
For Pool-Seq data (i.e., pooldata objects) only: minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded in the corresponding pairwise comparisons |
max.cov.per.pool |
For Pool-Seq data (i.e., pooldata objects) only: maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded in the corresponding pairwise comparisons. |
min.indgeno.per.pop |
For allele count data (i.e., countdata objects) only: minimal number of overall counts required in each population. If at least one pop is not genotyped for at least min.indgeno.per.pop (haploid) individual, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio over all read counts for the reference allele over the read coverage) in the pairwise comparisons. |
output.snp.values |
If TRUE, provide SNP-specific pairwise FST for each comparisons (may lead to a huge result object if the number of pools and/or SNPs is large) |
nsnp.per.bjack.block |
Number of consecutive SNPs within a block for block-jackknife (default=0, i.e., no block-jackknife sampling) |
verbose |
If TRUE extra information is printed on the terminal |
An object of class pairwisefst (see help(pairwisefst) for details)
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) PairwiseFST=compute.pairwiseFST(pooldata)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) PairwiseFST=compute.pairwiseFST(pooldata)
Compute Fst from Pool-Seq data or Count data
computeFST( x, method = "Anova", struct = NULL, weightpid = FALSE, nsnp.per.bjack.block = 0, sliding.window.size = 0, verbose = TRUE )
computeFST( x, method = "Anova", struct = NULL, weightpid = FALSE, nsnp.per.bjack.block = 0, sliding.window.size = 0, verbose = TRUE )
x |
A pooldata object containing Pool-Seq information or countdata object containing allele counts information |
method |
Either "Anova" (default method) or "Identity" (relying on unbiased estimators of Probability of Identity within and across pairs of pools/populations) |
struct |
Vector of length equal to the number of pop. sample that give the pop. sample group name of index (i.e., structure) |
weightpid |
When method="Identity", if TRUE weighting averages of pop. Q1 and pairwise Q2 are performed (see eq. A46 and A47 in Hivert et al., 2018 for PoolSeq and Rousset 2007 for count data) to compute overall Q1 and Q2. If not, unweighted averages are performed. |
nsnp.per.bjack.block |
Number of consecutive SNPs within a block for block-jackknife (default=0, i.e., no block-jackknife sampling) |
sliding.window.size |
Number of consecutive SNPs within a window for multi-locus computation of Fst over sliding window with half-window size step (default=0, i.e., no sliding-window scan) |
verbose |
If TRUE extra information is printed on the terminal |
A list with the four following elements:
"FST": estimate of genome-wide Fst over all the populations. The element is a vector with 5 elements corresponding to i) the estimated value over all SNPs; ii) the block-jackknife mean; iii) the block-jackknife s.e.; iv) the lower; and v) the upper bound of the 95
"FSG": under the hierarchical Fst model (i.e., when struct vector is non-null); estimates estimate of genome-wide within-group differentiation (Fsg). The element is a vector with 5 elements corresponding to i) the estimated value over all SNPs; ii) the block-jackknife mean; iii) the block-jackknife s.e.; iv) the lower; and v) the upper bound of the 95
"FGT": under the hierarchical Fst model (i.e., when struct vector is non-null); estimates estimate of genome-wide between-group differentiation (Fgt). The element is a vector with 5 elements corresponding to i) the estimated value over all SNPs; ii) the block-jackknife mean; iii) the block-jackknife s.e.; iv) the lower; and v) the upper bound of the 95
"snp.Fstats": a data frame containing SNP-specific estimates of Fst and also under the hierarchical (i.e., when struct vector is non-null) SNP-specific estimates Fsg and Fgt
"snp.Q": a data frame containing SNP-specific estimates of Q1 (within-population) and Q2 (between-population) probability of identity and also under the hierarchical (i.e., when struct vector is non-null) SNP-specific estimates of Q3, the probability of identity between populations from different groups (under this model Q2 is then the Pid between populations from the same group).
"sliding.windows.fvalues" (if sliding.window.size>0): a 4 or 6 (under hierarchical Fst model) column data frame containing information on multi-locus Fst (and Fsg and Fgt under the hierarchical Fst model) computed for sliding windows of SNPs over the whole genome with i) column with the chromosome/contig of origin of each window; ii) the mid-position of each window; iii) the cumulated mid-position of each window (to facilitate further plotting); iv) the estimated multi-locus Fst; and under the hierarchical Fst model v) the estimated multi-locus Fsg and ; vi) the estimated multi-locus Fgt
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fst=computeFST(pooldata) res.hierfst=computeFST(pooldata,struct=c(rep("A",5),rep("B",7),rep("C",3)))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fst=computeFST(pooldata) res.hierfst=computeFST(pooldata,struct=c(rep("A",5),rep("B",7),rep("C",3)))
S4 class to represent a Count data set.
npops
The number of populations
nsnp
The number of SNPs
refallele.count
A matrix (nsnp rows and npops columns) with the allele counts for the reference allele
total.count
A matrix (nsnp rows and npops columns) with the total number of counts (i.e., twice the number of genotyped individual for diploid species and autosomal markers)
snp.info
A data frame (nsnp rows and 4 columns) detailing for each SNP, the chromosome (or scaffold), the position, Reference allele name and Alternate allele name (if available)
popnames
A vector of length npops with the corresponding population names
To generate countdata object, see genobaypass2countdata
and genotreemix2countdata
Create a subset of a countdata object that contains count data as a function of pop or SNP indexes
countdata.subset( countdata, pop.index = 1:countdata@npops, snp.index = 1:countdata@nsnp, min.indgeno.per.pop = -1, min.maf = -1, return.snp.idx = FALSE, verbose = TRUE )
countdata.subset( countdata, pop.index = 1:countdata@npops, snp.index = 1:countdata@nsnp, min.indgeno.per.pop = -1, min.maf = -1, return.snp.idx = FALSE, verbose = TRUE )
countdata |
A countdata object containing Allele count information |
pop.index |
Indexes of the pools (at least two), that should be selected to create the new pooldata object (default=all the pools) |
snp.index |
Indexes of the SNPs (at least two), that should be selected to create the new pooldata object (default=all the SNPs) |
min.indgeno.per.pop |
Minimal number of overall counts required in each population. If at least one pop is not genotyped for at least min.indgeno.per.pop (haploid) individual, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overall counts for the reference allele over the overall number of (haploid) individual genotyped) |
return.snp.idx |
If TRUE, the row.names of the snp.info slot of the returned pooldata object are named as "rsx" where x is the index of SNP in the initial pooldata object (default=FALSE) |
verbose |
If TRUE return some information |
This function allows subsetting a pooldata object by selecting only some pools and/or some SNPs (e.g., based on their position on the genome). Additional filtering steps on SNPs can be carried out on the resulting subset to discard SNP with low polymorphism or poorly or too highly covered. In addition, coverage criteria can be applied on a per-pool basis with the cov.qthres.per.pool argument. 'more specific SNP selection based on their positions on the genome or their characteristics. For instance if qmax=0.95, a position is discarded if in a given pool it has a number of reads higher than the 95-th percentile of the empirical coverage distribution in this same pool (defined over the SNPs selected by snp.index). Similarly, if qmax=0.05, a position is discarded if in a given pool it has a number of reads lower than the 5-th percentile of the empirical coverage distribution in this same pool. This mode of selection may be more relevant when considering pools with heterogeneous read coverages.
A countdata object with 6 elements:
"refallele.count": a matrix (nsnp rows and npops columns) with the allele counts for the reference allele
"total.count": a matrix (nsnp rows and npops columns) with the total number of counts (i.e., twice the number of genotyped individual for diploid species and autosomal markers)
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.count matrix (3rd column); and the alternative allele (4th column)
"popnames": a vector of length npops containing the names of the pops
"nsnp": a scalar corresponding to the number of SNPs
"npops": a scalar corresponding to the number of populations
To generate countdata object, see genobaypass2countdata
, genotreemix2countdata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) ##NOTE: This example is just for the sake of illustration as it amounts to ##interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass")) subset.by.snps=countdata.subset(countdata,snp.index=10:100) subset.by.pops.and.snps=countdata.subset(countdata,pop.index=c(1,2),snp.index=10:100)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) ##NOTE: This example is just for the sake of illustration as it amounts to ##interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass")) subset.by.snps=countdata.subset(countdata,snp.index=10:100) subset.by.pops.and.snps=countdata.subset(countdata,pop.index=c(1,2),snp.index=10:100)
Convert a countdata object into BayPass allele count file. A file containing SNP details is also printed out. Options to generate sub-samples (e.g., for large number of SNPs) are also available.
countdata2genobaypass( countdata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
countdata2genobaypass( countdata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
countdata |
A countdata object |
writing.dir |
Directory where to create the files (e.g., set writing.dir=getwd() to copy in the current working directory) |
prefix |
Prefix used for output file names |
subsamplesize |
Size of the sub-samples. If <=1 (default), all the SNPs are considered in the output |
subsamplingmethod |
If sub-sampling is activated (argument subsamplesize), define the method used for subsampling that might be either i) "random" (A single data set consisting of randmly chosen SNPs is generated) or ii) "thinning", sub-samples are generated by taking SNPs one every nsub=floor(nsnp/subsamplesize) in the order of the map (a suffix ".subn" is added to each sub-sample files where n varies from 1 to nsub). |
Files containing allele count (in BayPass format) and SNP details (as in the snp.info matrix from the countdata object)
To generate countdata object, see genotreemix2countdata
, genobaypass2countdata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) ##NOTE: This example is just for the sake of illustration as it amounts to ##interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass")) countdata2genobaypass(countdata=countdata,writing.dir=tempdir())
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) ##NOTE: This example is just for the sake of illustration as it amounts to ##interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass")) countdata2genobaypass(countdata=countdata,writing.dir=tempdir())
Extract the alleles from the REF and ALT fields
.extract_allele_names(allele_info, allele_idx)
.extract_allele_names(allele_info, allele_idx)
allele_info |
a character string vector (concatenated REF and ALT field of the vcf) |
allele_idx |
Matrix with indexes of the two alleles of interest for the different markers |
Extract the alleles from the REF and ALT fields
Return a matrix with the two alleles after parsing the alleles info
.extract_allele_names(c("A,C","A,C,T"),rbind(c(1,2),c(1,3)))
.extract_allele_names(c("A,C","A,C,T"),rbind(c(1,2),c(1,3)))
Extract counts from vcf produced by other caller than VarScan (e.g., bcftools, FreeBayes, GATK)
.extract_nonvscan_counts(vcf_data, nb_all, ad_idx, min_rc)
.extract_nonvscan_counts(vcf_data, nb_all, ad_idx, min_rc)
vcf_data |
a matrix of String containing count information |
nb_all |
a vector containing the number of alleles for the different markers |
ad_idx |
the index of the FORMAT AD field |
min_rc |
Minimal allowed read count per base (same as min.rc option in |
Extract VarScan counts and return read counts for the reference and alternate allele
A numeric matrix of read count with nsnp rows and 2*npools+6 columns. The first npools columns consist of read count for the reference allele, columns npools+1 to 2*npools consist of read coverage. The last 6 columns correspond to the index of the two most frequent alleles (idx_all1 and idx_all2) and their count (cnt_all1 and cnt_all2); the min_rc filtering criterion and count of variant (cnt_bases) other than two first most frequent. The min_rc crit is set to -1 for polymorphisms with more than 2 alleles and with the third most frequent alleles having more than min_rc count
.extract_nonvscan_counts(rbind(c("0/0:20,0","1/1:1,18"),c("0/2:12,1,15","1/1:27,1,0")),c(2,3),2,0) .extract_nonvscan_counts(rbind(c("0/0:20,0","1/1:1,18"),c("0/2:12,1,15","1/1:27,1,0")),c(2,3),2,2)
.extract_nonvscan_counts(rbind(c("0/0:20,0","1/1:1,18"),c("0/2:12,1,15","1/1:27,1,0")),c(2,3),2,0) .extract_nonvscan_counts(rbind(c("0/0:20,0","1/1:1,18"),c("0/2:12,1,15","1/1:27,1,0")),c(2,3),2,2)
Extract VarScan counts
.extract_vscan_counts(vcf_data, ad_idx, rd_idx)
.extract_vscan_counts(vcf_data, ad_idx, rd_idx)
vcf_data |
a matrix of String containing count information in VarScan format |
ad_idx |
the index of the FORMAT AD field |
rd_idx |
the index of the FORMAT RD field |
Extract VarScan counts and return read counts for the reference and alternate allele. For VarScan generated vcf, SNPs with more than one alternate allele are discarded (because only a single count is then reported in the AD fields) making the min.rc unavailable (of vcf2pooldata). The VarScan –min-reads2 option might replace to some extent the min.rc functionality although SNP where the two major alleles in the Pool-Seq data are different from the reference allele (e.g., expected to be more frequent when using a distantly related reference genome for mapping) will be disregarded.
A numeric matrix of read count with nsnp rows and 2*npools columns. The first npools columns consist of read count for the reference allele (RD), columns npools+1 to 2*npools consist of read coverage (RD+AD)
.extract_vscan_counts(rbind(c("0/0:0:20","1/1:18:1"),c("0/1:12:15","1/1:27:2")),3,2)
.extract_vscan_counts(rbind(c("0/0:0:20","1/1:18:1"),c("0/1:12:15","1/1:27:2")),3,2)
Search for the closest indels of the markers
.find_indelneighbor_idx(contig, position, indels_idx, min_dist, indels_size)
.find_indelneighbor_idx(contig, position, indels_idx, min_dist, indels_size)
contig |
a character string vector corresponding to the CHR field value of the vcf for the markers |
position |
an integer vector corresponding to the POSITION value for the markers |
indels_idx |
vector of (0-indexed) indices of indels |
min_dist |
same as min.dist.from.indels option in |
indels_size |
size of the indels (associated to indels_idx) |
Identify if the SNPs are close to an indel
Return a vector consisting of 1 (if the marker is close to an indel) or 0 (if not)
.find_indelneighbor_idx(c("chr1","chr1","chr1"),c(1000,1004,1020),1,5,2)
.find_indelneighbor_idx(c("chr1","chr1","chr1"),c(1000,1004,1020),1,5,2)
Find sets of populations that may used as scaffold tree
find.tree.popset( fstats, f3.zcore.threshold = -1.65, f4.zscore.absolute.threshold = 1.96, excluded.pops = NULL, nthreads = 1, verbose = TRUE )
find.tree.popset( fstats, f3.zcore.threshold = -1.65, f4.zscore.absolute.threshold = 1.96, excluded.pops = NULL, nthreads = 1, verbose = TRUE )
fstats |
Object of class fstats containing estimates of fstats (see the function compute.fstats) |
f3.zcore.threshold |
The significance threshold for Z-score of formal test of admixture based on the F3-statistics (default=-2) |
f4.zscore.absolute.threshold |
The significance threshold for |Z-score| of formal test of treeness based on the F4-statistics (default=2) |
excluded.pops |
Vector of pop names to be exclude from the exploration |
nthreads |
Number of available threads for parallelization of some part of the parsing (default=1, i.e., no parallelization) |
verbose |
If TRUE extra information is printed on the terminal |
The procedure first discards all the populations P that shows a significant signal of admixture with a Z-score for F3 statistics of the form F3(P;Q,R) < f3.zscore.thresholds. It then identifies all the sets of populations that pass the F4-based treeness with themselves. More precisely, for a given set E containing n populations, the procedure ensure that all the n(n-1)(n-2)(n-3)/8 possible F4 quadruplets have a |Z-score|<f4.zscore.absolute.threshold. The function aims at maximizing the size of the sets.
A list with the following elements:
"n.sets": The number of sets of (scaffold) unadmixed populations identified
"set.size": The number of populations included in each set
"pop.sets": A character matrix of n.sets rows and set.size columns giving for each set identified the names of the included populations.
"Z_f4.range": A matrix of n.sets rows and 2 columns reported for each set the range of variation (min and max value) of the absolute F4 Z-scores for the quadruplets passing the treeness test. More precisely, for a given set consisting of n=set.size populations, a total of n(n-1)(n-2)(n-3)/8 quadruplets can be formed. Yet, any set of four populations A, B, C and D is represented by three quadruplets A,B;C,D (or one of its seven other equivalent combinations formed by permuting each pairs); A,C;B,D (or one of its seven other equivalent combinations) and A,D;B,C (or one of its seven other combinations). Among these three, only a single quadruplet is expected to pass the treeness test (i.e., if the correct unrooted tree topology is (A,C;B,D), then the absoulte value of the Z-scores associated to F4(A,B;C,D) and F4(A,D;B,C) or their equivalent will be high.
"passing.quadruplets": A matrix of n.sets rows and set.size columns reporting for each sets the n(n-1)(n-2)(n-3)/24 quadruplets that pass the treeness test (see Z_f4.range detail).
see compute.fstats
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata,nsnp.per.bjack.block = 50) #NOTE: toy example (in practice nsnp.per.bjack.block should be higher) popsets=find.tree.popset(res.fstats,f3.zcore.threshold=-3)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.fstats=compute.fstats(pooldata,nsnp.per.bjack.block = 50) #NOTE: toy example (in practice nsnp.per.bjack.block should be higher) popsets=find.tree.popset(res.fstats,f3.zcore.threshold=-3)
Estimate parameters of an admixture graph
fit.graph( graph.params, Q.lambda = 0, eps.admix.prop = 1e-06, edge.fact = 1000, admix.fact = 100, compute.ci = F, drift.scaling = F, outfileprefix = NULL, verbose = TRUE )
fit.graph( graph.params, Q.lambda = 0, eps.admix.prop = 1e-06, edge.fact = 1000, admix.fact = 100, compute.ci = F, drift.scaling = F, outfileprefix = NULL, verbose = TRUE )
graph.params |
An object of class graph.params containing graph information and relevant Fstats estimates (see the function generate.graph.params) |
Q.lambda |
A scalar (usually small) to add to the diagonal elements of the error covariance matrix of fstats estimates (may improve numerical stability of its decomposition for large number of populations) |
eps.admix.prop |
A scalar defining admixture proportion domain (eps.admix.prop vary between eps.admix.prop and 1-eps.admix.prop) |
edge.fact |
The multiplying factor of edges length in graph representation |
admix.fact |
The multiplying factor of admixture proportion in graph representation |
compute.ci |
Derive 95% Confidence Intervals for the parameters of the admixture graph (edge lengths and admixture rates) |
drift.scaling |
If TRUE scale edge lengths in drift units (require estimates of leave heterozygosities) |
outfileprefix |
The prefix of the dot file that will represent the graph (with extension ".dot"). If NULL, no graph file generated |
verbose |
If TRUE extra information is printed on the terminal |
Let represent the n-length vector of basis target (i.e., observed) F2 and F3 statistics and
the vector of their expected values given the vector of graph edges lengths
and the incidence matrix
that depends on the structure of the graph and the admixture rates
(if there is no admixture in the graph,
only contains 0 or 1). The function attempts to find the
and
graph parameter values that minimize a cost (score of the model) defined as
. Assuming
(i.e., the observed f-statistics vector is multivariate normal distributed around an expected g vector specified by the admixture graph and a covariance structure empirically estimated),
where
is the likelihood of the fitted graph and
. Also, for model comparison purpose, a standard
is then derived from
as
(p being the number of graph parameters, i.e., edge lengths and admixture rates).
As mentioned by Patterson et al. (2012), the score
is quadratic in edge lengths
given
. The function uses the Lawson-Hanson non-negative linear least squares algorithm implemented in the nnls function (package nnls) to estimate
(subject to the constraint of positive edge lengths) by finding the vector
that minimize
(where
results from the Cholesky decomposition of
, i.e.,
). Note that the *Q.lambda* argument may be used to add a small constant (e.g.,
) to the diagonal elements of
to avoid numerical problems (see Patterson et al., 2012). Yet *Q.lambda* is always disregarded when computing the final score
and
. Minimization of
is thus reduced to the identification of the admixture rates (
vector) which is performed using the L-BFGS-B method (i.e., Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints) implemented in the optim function (stats package). The *eps.admix.prop* argument allows specifying the lower and upper bound of the admixture rates to *eps.admix.prop* and *1-eps.admix.prop* respectively.
Scaling of the edges lengths in drift units (i.e., in units of
where
is time in generations and
is the effective population size) is performed as described in Lipson et al. (MBE, 2013) by dividing the estimated edges lengths by half the estimated heterozygosity of their parental nodes (using the property
where
and
are the heterozygosities of a child C and its parent P node and
is the estimated length of the branch relating C and P.
Finally, if compute.ci=TRUE, a (rough)
confidence intervals is computed using a bisection method (with a
precision) for each parameters in turn (all others being set to their estimated value). Note that
CI are here defined as the set of values associated to a score
such that
(where
is the optimized score), i.e., with a likelihood-ratio test statistic with respect to the fitted values
(the
threshold of a one ddl Chi-square distribution).
An object of class fitted.graph (see help(fitted.graph) for details)
To generate a graph.params object, see generate.graph.params
. The fitted graph may be plotted directly using plot that calls grViz() function and the resulting fitted fstats may be compared to the estimated ones with compare.fitted.fstats
.
S4 class to represent a population tree or admixture graph and its underlying fitted parameter.
The dot.graph element allows to plot the graph using grViz() from the DiagrammeR package or with the dot program after writing the files (e.g., dot -Tpng inputgraph.dot in terminal). Note that the dot file may be customized (e.g., to change leave color, parameter names...).
graph
The graph in 3 column format originated from the fitted graph.params object
dot.graph
The fitted graph in dot format
score
the score of the model (squared Mahalanobis distance between the observed and fitted basis F-statistics vectors)
bic
The Bayesian Information Criterion associated to the model
fitted.outstats
a matrix containing the target values of the fstats, the fitted values and the Z-score measuring the deviation of the fitted values from the target values in units of standard errors (i.e., Z=(fitted.value-target.value)/se(target.value))
edges.length
a vector containing the estimated edges.length. Note finally, that the (two) edges coming from the roots are assumed of equal length (i.e., unrooted branch) as these are non-identifiable by the method.
edges.length.scaled
If drift.scaling=TRUE, the estimated edges.length in units of t/2N
edges.length.ci
A matrix with two columns (or four columns if drift scaled lengths are computed) containing for each edge length (in a row) the 95% CI lower and higher bounds (columns 3 and 4 containing 95% CI lower and higher bounds of drift scaled lengths, if any)
admix.prop
a vector containing the estimated admixture proportions (if any)
admix.prop.ci
a matrix with two columns containing for each admixture proportion (in a row) the 95% CI lower and higher bounds
nodes.het
The estimated heterozygosities for all nodes (if available; see drift.scaling argument in fit.graph)
fitted.f2.mat
the matrix of all the fitted F2 statistics (obtained from fitted admixture graph parameter values) from which all the fitted fstats can be derived.
optim.results
list containing results of the optim call
To generate fitted.graph object, see fit.graph
.
S4 class to represent fstats results obtained with computeFstats.
f2.values
A data frame with npop(npop-1)/2 rows and 1 (or 3 if blockjackknife is TRUE) columns containing estimates of the f2-statistics over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.)
fst.values
A data frame with npop(npop-1)/2 rows and 1 (or 3 if blockjackknife is TRUE) columns containing estimates of the scaled f2.values (same as obtained with compute.pairwiseFST with method="Identity") over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.). The F2 scaling factor is equal to 1-Q2 (where Q2 is the AIS probability between the two populations)
f3.values
A data frame with npops(npops-1)(npops-2)/2 rows and 1 (or 4 if blockjackknife is TRUE) columns containing estimates of the f3-statistics over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.) and Z-score measuring the deviation of the f3-statistics from 0 in units of s.e.
f3star.values
A data frame with npops(npops-1)(npops-2)/2 rows and 1 (or 4 if blockjackknife is TRUE) columns containing estimates of the scaled f3-statistics over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.) and Z-score measuring the deviation of the f3-statistics from 0 in units of s.e. The F3 scaling factor is equal to 1-Q1 (where Q1 is the AIS probability within the target population, i.e., population C for F3(C;A,B))
f4.values
A data frame with npops(npops-1)(npops-2)(npops-3)/8 rows and 1 (or 4 if blockjackknife is TRUE) columns containing estimates of the f4-statistics over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.) and Z-score measuring the deviation of the f4-statistics from 0 in units of s.e.
Dstat.values
A data frame with npops(npops-1)(npops-2)(npops-3)/8 rows and 1 (or 4 if blockjackknife is TRUE) columns containing estimates of the D-statistics (scaled f4-statistics) over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.) and Z-score measuring the deviation of the f3-statistics from 0 in units of s.e. For a given quadruplet (A,B;C,D), the parameter D corresponds to F4(A,B;C,D) scaled by (1-Q2(A,B))*(1-Q2(C,D)) where Q2(X,Y) is the is the AIS probability between the X and Y populations.
F2.bjack.samples
If blockjackknife=TRUE and options return.F2.blockjackknife.samples is actived in compute.fstats, an array of dimension (npop x npop x nblocks) in an admixtools2 compatible format
comparisons
A list containing matrices with population names associated to the different test comparisons (e.g., the "F2" elements of the list is a npop(npop-1)/2 rows x 2 columns with each row containing the name of the two populations compared)
Q.matrix
The estimated error covariance matrix for all the F2 and F3 estimates (required by graph fitting functions to compute graph scores)
heterozygosities
A data frame with npop rows and 1 (or 3 if blockjackknife is TRUE) columns containing estimates of the within population heterozygosities (1-Q1) over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.)
divergence
A data frame with npop(npop-1)/2 rows and 1 (or 3 if blockjackknife is TRUE) column(s) containing estimates of each population pairwise (absolute) divergence (1-Q2) over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.). This statistic is related to dXY (a.k.a. PiXY) but it is computed on the ascertained SNPs that were included in the original pooldata or countdata objects.
pairwise.fst
A npop x npop (symmetric) matrix containing the pairwise-population Fst estimates (same as in the fst.values object) that may directly be visualized with e.g. heatmap function or used with a clustering function (e.g., hclust).
pairwise.div
A npop x npop (symmetric) matrix containing the pairwise-population divergence (1-Q2) estimates (same as in the fst.values object) that may directly be visualized with e.g. heatmap function or used with a clustering function (e.g., hclust).
blockjacknife
A logical indicating whether block-jackknife estimates of standard errors are available (TRUE) or not (FALSE)
To generate pairwise object, see compute.pairwiseFST
Generate a graph parameter object to fit admixture graph to observed fstats
generate.graph.params( graph, fstats = NULL, popref = NULL, outfileprefix = NULL, verbose = TRUE )
generate.graph.params( graph, fstats = NULL, popref = NULL, outfileprefix = NULL, verbose = TRUE )
graph |
A three columns matrix containing graph information in a simple format (see details) |
fstats |
A fstats object containing estimates of fstats |
popref |
Reference population of the fstats basis used to fit the graph. |
outfileprefix |
The prefix of the dot file that will represent the graph (with extension ".dot"). If NULL, no graph file generated |
verbose |
If TRUE some information is printed on the terminal |
The graph needs to be specified by a three column (character) matrix corresponding for each edge (wether admixed or not) to i) the child node; ii) the parent node; iii) the admixture proportion. For non-admixed edge, the third column must be blank. An admixed node should be referred two times as a child node with two different parent node and two different admixture proportions coded as alpha and (1-alpha) (Note that the parentheses are mandatory) if alpha is the name of the admixture proportion. The root is automatically identified as a node only present in the parent node column. Several checks are made within the function but it is recommended to check the graph by plotting the resulting dot file named [outfileprefix].dot using for instance the grViz() from the DiagrammeR package that may be called directly with plot or with the dot program (e.g., dot -Tpng inputgraph.dot in terminal). Note that the dot file may be easily customized (e.g., to change leave color, parameter names...). The fstats object should be of class fstats (see help(fstats) for details) containing estimates of F2 and F3 statistics and block jackknife as generated with the compute.fstats
function with computeF3 set to TRUE. If no fstats object is provided, only graph parameters will be generated.
An object of class graph.params (see help(graph.params) for details)
The object may be used to estimate graph parameters with the function fit.graph
or to generate files for the qpGraph software with graph.params2qpGraphFiles
. See also graph.params2symbolic.fstats
to obtain symbolic representation of Fstats.
graph=rbind(c("P1","P7",""),c("P2","s1",""),c("P3","s2",""),c("P6","S",""), c("S","s1","a"),c("S","s2","(1-a)"),c("s2","P8",""),c("s1","P7",""), c("P4","P9",""),c("P5","P9",""),c("P7","P8",""), c("P8","R",""),c("P9","R","")) graph.params=generate.graph.params(graph) plot(graph.params) ##NOTE: this calls grViz from DiagrammeR which cannot easily be plotted #within pdf or other device. To that end the easiest is to output #the graph in a dot file (using the outfileprefix argument) and #then to use the dot program out of R in a terminal: dot -Tpng inputgraph.dot
graph=rbind(c("P1","P7",""),c("P2","s1",""),c("P3","s2",""),c("P6","S",""), c("S","s1","a"),c("S","s2","(1-a)"),c("s2","P8",""),c("s1","P7",""), c("P4","P9",""),c("P5","P9",""),c("P7","P8",""), c("P8","R",""),c("P9","R","")) graph.params=generate.graph.params(graph) plot(graph.params) ##NOTE: this calls grViz from DiagrammeR which cannot easily be plotted #within pdf or other device. To that end the easiest is to output #the graph in a dot file (using the outfileprefix argument) and #then to use the dot program out of R in a terminal: dot -Tpng inputgraph.dot
Generate block coordinates for block-jackknife
generate.jackknife.blocks(x, nsnp.per.bjack.block, verbose = TRUE)
generate.jackknife.blocks(x, nsnp.per.bjack.block, verbose = TRUE)
x |
A pooldata or countdata object containing SNP positions (snp.info slot) |
nsnp.per.bjack.block |
Number of consecutive SNPs of each block-jackknife block |
verbose |
If TRUE extra information is printed on the terminal |
A list with the two following elements:
"blocks.det": A matrix with three columns containing for each identified block (in row) the index of the start SNP, the index of the end SNP and the block Size in bp
"snp.block.id": A vector containing the blocks assigned to each SNP eligible for block-Jacknife (non eligible SNPs ares assigned NA)
"nblocks": A scalar corresponding to the number of blocks
"nsnps": Number of SNPs eligible for block-jackknife 'i.e., included in one block
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) bjack.blocks=generate.jackknife.blocks(pooldata,nsnp.per.bjack.block=50)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) bjack.blocks=generate.jackknife.blocks(pooldata,nsnp.per.bjack.block=50)
Generate all names for F3 stats (same order as computation)
.generateF3names(popnames)
.generateF3names(popnames)
popnames |
String vector with the names of all the pops |
Generate all the npops*(npops-1)*(npops-2)/2 names for F3 stats (same order as computation)
Return a string matrix with 4 columns including the complete F3 configuration names (of the form Px;P1,P2), and the names of each pop involved in the configuration
#
#
Generate all names for F4 stats (same order as computation)
.generateF4names(popnames)
.generateF4names(popnames)
popnames |
String vector with the names of all the pops |
Generate all the nf4=(npops*(npops-1)/2)*((npops-2)*(npops-3)/2)/2 names for F4 stats (same order as computation)
Return a string matrix with 5 columns including the complete F4 configuration names (of the form P1,P2;P3,P4), and the names of each pop involved in the configuration
#
Convert BayPass allele count input files into a coundata object
genobaypass2countdata( genobaypass.file = "", snp.pos = NA, popnames = NA, min.indgeno.per.pop = -1, min.maf = -1, verbose = TRUE )
genobaypass2countdata( genobaypass.file = "", snp.pos = NA, popnames = NA, min.indgeno.per.pop = -1, min.maf = -1, verbose = TRUE )
genobaypass.file |
The name (or a path) of the BayPass allele count file (see the BayPass manual https://forgemia.inra.fr/mathieu.gautier/baypass_public/) |
snp.pos |
An optional two column matrix with nsnps rows containing the chromosome (or contig/scaffold) of origin and the position of each markers |
popnames |
A character vector with the names of pool |
min.indgeno.per.pop |
Minimal number of overall counts required in each population. If at least one pop is not genotyped for at least min.indgeno.per.pop (haploid) individual, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overall counts for the reference allele over the overall number of (haploid) individual genotyped) |
verbose |
If TRUE extra information is printed on the terminal |
Information on SNP position is only required for some graphical display or to carried out block-jacknife sampling estimation of confidence intervals. If no mapping information is given (default), SNPs will be assumed to be ordered on the same chromosome and separated by 1 bp. As blocks are defined with a number of consecutive SNPs (rather than a length), the latter assumption has actually no effect (except in the reported estimated block sizes in Mb).
A countdata object containing 6 elements:
"refallele.count": a matrix (nsnp rows and npops columns) with the allele counts for the reference allele
"total.count": a matrix (nsnp rows and npops columns) with the total number of counts (i.e., twice the number of genotyped individual for diploid species and autosomal markers)
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.count matrix (3rd column); and the alternative allele (4th column)
"popnames": a vector of length npops containing the names of the pops
"nsnp": a scalar corresponding to the number of SNPs
"npops": a scalar corresponding to the number of populations
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) ##NOTE: This example is just for the sake of illustration as it amounts ##to interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass"))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) ##NOTE: This example is just for the sake of illustration as it amounts ##to interpret read count as allele count which must not be done in practice! countdata=genobaypass2countdata(genobaypass.file=paste0(tempdir(),"/genobaypass"))
Convert BayPass read count and haploid pool size input files into a pooldata object
genobaypass2pooldata( genobaypass.file = "", poolsize.file = "", snp.pos = NA, poolnames = NA, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, verbose = TRUE )
genobaypass2pooldata( genobaypass.file = "", poolsize.file = "", snp.pos = NA, poolnames = NA, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, verbose = TRUE )
genobaypass.file |
The name (or a path) of the BayPass read count file (see the BayPass manual https://forgemia.inra.fr/mathieu.gautier/baypass_public/) |
poolsize.file |
The name (or a path) of the BayPass (haploid) pool size file (see the BayPass manual https://forgemia.inra.fr/mathieu.gautier/baypass_public/) |
snp.pos |
An optional two column matrix with nsnps rows containing the chromosome (or contig/scaffold) of origin and the position of each markers |
poolnames |
A character vector with the names of pool |
min.cov.per.pool |
Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded |
max.cov.per.pool |
Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overall read counts for the reference allele over the read coverage) |
verbose |
If TRUE extra information is printed on the terminal |
Information on SNP position is only required for some graphical display or to carried out block-jacknife sampling estimation of confidence intervals. If no mapping information is given (default), SNPs will be assumed to be ordered on the same chromosome and separated by 1 bp. As blocks are defined with a number of consecutive SNPs (rather than a length), the latter assumption has actually no effect (except in the reported estimated block sizes in Mb).
A pooldata object containing 7 elements:
"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.readcount matrix (3rd column); and the alternative allele (4th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) pooldata=genobaypass2pooldata(genobaypass.file=paste0(tempdir(),"/genobaypass"), poolsize.file=paste0(tempdir(),"/poolsize"))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir()) pooldata=genobaypass2pooldata(genobaypass.file=paste0(tempdir(),"/genobaypass"), poolsize.file=paste0(tempdir(),"/poolsize"))
Convert SelEstim read count input files into a pooldata object
genoselestim2pooldata( genoselestim.file = "", poolnames = NA, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, nlines.per.readblock = 1e+06, verbose = TRUE )
genoselestim2pooldata( genoselestim.file = "", poolnames = NA, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, nlines.per.readblock = 1e+06, verbose = TRUE )
genoselestim.file |
The name (or a path) of the SelEstim read count file (see the SelEstim manual https://www1.montpellier.inrae.fr/CBGP/software/selestim/) |
poolnames |
A character vector with the names of pool |
min.cov.per.pool |
Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded |
max.cov.per.pool |
Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overal read counts for the reference allele over the read coverage) |
nlines.per.readblock |
Number of Lines read simultaneously. Should be adapted to the available RAM. |
verbose |
If TRUE extra information is printed on the terminal |
A pooldata object containing 7 elements:
"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.readcount matrix (3rd column); and the alternative allele (4th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genoselestim(pooldata=pooldata,writing.dir=tempdir()) pooldata=genoselestim2pooldata(genoselestim.file=paste0(tempdir(),"/genoselestim"))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genoselestim(pooldata=pooldata,writing.dir=tempdir()) pooldata=genoselestim2pooldata(genoselestim.file=paste0(tempdir(),"/genoselestim"))
Convert allele count input files from the Treemix program into a coundata object
genotreemix2countdata( genotreemix.file = "", snp.pos = NA, min.indgeno.per.pop = -1, min.maf = -1, verbose = TRUE )
genotreemix2countdata( genotreemix.file = "", snp.pos = NA, min.indgeno.per.pop = -1, min.maf = -1, verbose = TRUE )
genotreemix.file |
The name (or a path) of the Treemix allele count file (see the Treemix manual https://bitbucket.org/nygcresearch/treemix/wiki/Home) |
snp.pos |
An optional two column matrix with nsnps rows containing the chromosome (or contig/scaffold) of origin and the position of each markers |
min.indgeno.per.pop |
Minimal number of overall counts required in each population. If at least one pop is not genotyped for at least min.indgeno.per.pop (haploid) individual, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overall counts for the reference allele over the overall number of (haploid) individual genotyped) |
verbose |
If TRUE extra information is printed on the terminal |
Information on SNP position is only required for some graphical display or to carried out block-jacknife sampling estimation of confidence intervals. If no mapping information is given (default), SNPs will be assumed to be ordered on the same chromosome and separated by 1 bp. As blocks are defined with a number of consecutive SNPs (rather than a length), the latter assumption has actually no effect (except in the reported estimated block sizes in Mb).
A countdata object containing 6 elements:
"refallele.count": a matrix (nsnp rows and npops columns) with the allele counts for the reference allele
"total.count": a matrix (nsnp rows and npops columns) with the total number of counts (i.e., twice the number of genotyped individual for diploid species and autosomal markers)
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.count matrix (3rd column); and the alternative allele (4th column)
"popnames": a vector of length npops containing the names of the pops
"nsnp": a scalar corresponding to the number of SNPs
"npops": a scalar corresponding to the number of populations
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) ##NOTE: This example is just for the sake of illustration as it amounts ##to interpret read count as allele count which must not be done in practice! dum=matrix(paste([email protected], pooldata@[email protected],sep=","), ncol=pooldata@npools) colnames(dum)=pooldata@poolnames write.table(dum,file=paste0(tempdir(),"/genotreemix"),quote=FALSE,row.names=FALSE) countdata=genotreemix2countdata(genotreemix.file=paste0(tempdir(),"/genotreemix"))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) ##NOTE: This example is just for the sake of illustration as it amounts ##to interpret read count as allele count which must not be done in practice! dum=matrix(paste(pooldata@refallele.readcount, pooldata@readcoverage-pooldata@refallele.readcount,sep=","), ncol=pooldata@npools) colnames(dum)=pooldata@poolnames write.table(dum,file=paste0(tempdir(),"/genotreemix"),quote=FALSE,row.names=FALSE) countdata=genotreemix2countdata(genotreemix.file=paste0(tempdir(),"/genotreemix"))
Implement a graph builder heuristic by successively adding leaves to an initial graph
graph.builder( x, leaves.to.add, fstats, heap.dbic = 6, max.heap.size = 25, verbose = TRUE, ... )
graph.builder( x, leaves.to.add, fstats, heap.dbic = 6, max.heap.size = 25, verbose = TRUE, ... )
x |
An object (or list of objects) of class graph.params or fitted.graph (see details) |
leaves.to.add |
Names of the leaves to successively add (in the given order) |
fstats |
Object of class fstats that contains estimates of the fstats (see compute.fstats) |
heap.dbic |
Maximal BIC distance from the best graph to be kept in the heap (heap.dbic=6 by default) |
max.heap.size |
Maximal number of graphs stored in the heap (max.heap.size=25 by default) |
verbose |
If TRUE extra information is printed on the terminal |
... |
Some parameters to be passed the function add.leaf called internally |
The input object x needs to be of class graph.params as generated by the function generate.graph.params; or fitted.graph as generated by the functions fit.graph, add.leaf (in the output list element named "fitted.graphs.list") or rooted.nj.builder (in the output element named "best.rooted.tree"). This is to ensure that the matrix describing the structure of the graph (graph slot of these objects) is valid (note that it can be plotted for checks). Hence graph.params objects may have been generated without fstats information (that should be supplied independently to the add.leaf function to obtain information on the fstats involving the candidate leaf defined with the leaf.to.add argument). The functions successively add each leaf given in the leaves.to.add vector to the list of fitted graph stored in a heap using the function add.leaf. For the first iteration (i.e., first tested leaf) the heap consists of the input graph or list of graph x. At each iteration, the function add.leaf is used to test the candidate leaf to each graph from the current heap in turn. A new heap of graphs is then built by each time including the fitted graphs with a BIC less than heap.dbic larger than the best resulting graphs (treating each graph independently). If the final number of graphs in the heap is larger than max.heap.size, the max.heap.size graphs with the lowest BIC are kept in the heap. After testing the latest leaf, graphs with a BIC larger than heap.dbic units of the best graph are discarded from the final list of graphs. In practice, it is recommended to test different orders of inclusion of the leaves (as specified in the vector leaves.to.add)
A list with the following elements:
"n.graphs": The final number of fitted graphs
"fitted.graphs.list": a list of fitted.graph objects (indexed from 1 to n.graphs and in the same order as the list "graphs") containing the results of fitting of each graph.
"best.fitted.graph": The graph (object of class fitted.graph) with the minimal BIC (see function fit.graph) among all the graphs within fitted.graphs.list
"bic": a vector of the n.graphs BIC (indexed from 1 to n.graphs and in the same order as the "fitted.graphs.list" list) (see fit.graph details for the computation of the scores).
see fit.graph
, generate.graph.params
and add.leaf
.
S4 class to represent a population tree or admixture graph and its underlying parameter.
The graph is specified by a three column (character) matrix giving for each edge (whether admixed or not) to i) the child node; ii) the parent node; iii) the admixture proportion. For non-admixed edge, the third column must be blank. An admixed node should be referred two times as a child node with two different parent node and two different admixture proportions coded as alpha and (1-alpha) (parentheses are mandatory) if alpha is the name of the parameter for admixture proportion. The dot.graph element allows to plot the graph using grViz() from the DiagrammeR package or with the dot program after writing the files (e.g., dot -Tpng inputgraph.dot in terminal). Note that the dot file may be customized (e.g., to change leave color, parameter names...).
graph
The graph in 3 column format (see details)
dot.graph
The graph in dot format
is.admgraph
If FALSE the graph is binary tree (i.e., no admixture events), if TRUE the graph is an admixture graph
n.leaves
Number of leaves of the graph
leaves
Name of the leaves
root.name
Name of the root
n.nodes
Number of nodes (including root)
nodes.names
Name of the nodes
n.edges
Number of edges (including admixture edges)
edges.names
Names of the edges (coded as "Parent node Name"<->"Child node Name")
n.adm.nodes
Number of admixed nodes (=0 if is.admgraph=FALSE). This is also the number of admixed parameters since only two-ways admixture are assumed for a given node
adm.params.names
Names of the admixed parameters
graph.matrix
The graph incidence matrix consisting of n.leaves rows and n.edges columns. The elements of the matrix are the weights of each edge (in symbolic representation) for the different possible paths from the leaves to the graph root.
root.edges.idx
Indexes of the graph.matrix columns associated to the (two) edges connected to the root
f2.target
The (n.leaves-1) stats F2 involving popref (i.e., of the form F2(popref;pop))
f2.target.pops
A matrix of (n.leaves-1) rows and 2 columns containing the names of populations of the F2 stats. The first column is by construction always popref. The order is the same as in f2.target
f3.target
The (n.leaves-1)(n.leaves-2)/2 stats F3 involving popref as a target (i.e., of the form F3(popref;popA,popB))
f3.target.pops
A matrix of (n.leaves-1)(n.leaves-2)/2 rows and 3 columns containing the name of popref in the first column and the names of the two populations involved in the F3 stats. The order is the same as in f3.target
popref
The name of the reference population defining the fstats basis
f.Qmat
A square matrix of rank n.leaves(n.leaves-1)/2 corresponding to the error covariance matrix of the F2 and F3 estimates
Het
Estimated leave heterozygosities (if present in the fstats object)
To generate graph.params object, see generate.graph.params
. The object may be used to estimate graph parameters with the function fit.graph
or to generate files for the qpGraph software with graph.params2qpGraphFiles
. See also graph.params2symbolic.fstats
to obtain symbolic representation of Fstats from the matrix "Omega".
Generate files for the qpGraph software from a graph.params object
graph.params2qpGraphFiles( graph.params, outfileprefix = "out", n.printed.dec = 4, verbose = TRUE )
graph.params2qpGraphFiles( graph.params, outfileprefix = "out", n.printed.dec = 4, verbose = TRUE )
graph.params |
An object of class graph.params containing graph information with Fstats information (see the function generate.graph.params) |
outfileprefix |
The prefix of the qpGraph files |
n.printed.dec |
Number of decimal to be printed (if not enough may lead to fatalx error in qpGraph) |
verbose |
If TRUE extra information is printed on the terminal |
This function generates the three files required by qpGraph: i) a file named [outfileprefix].graph containing the graph in appropriate format; ii) a file named [outfileprefix].fstats file containing the fstats estimates of fstats (and their covariance); iii) a file named [outfileprefix].parqpGraph containing essential parameter information to run qpGraph (this may be edited by hand if other options are needed). The qpGraph software may then be run using the following options -p [outfileprefix].parqpGraph -g [outfileprefix].graph -o out.ggg -d out.dot.
The three files described in the details section
To generate graph.params object, see generate.graph.params
Provide a symbolic representation of all the F-statistics and the model system of equations
graph.params2symbolic.fstats(x, outfile = NULL)
graph.params2symbolic.fstats(x, outfile = NULL)
x |
An object of class graph.params containing graph information and relevant Fstats estimates (see the function generate.graph.params) |
outfile |
The file where to print the equations (default=NULL, equations are not printed in a file) |
A list with the following elements:
"model.matrix": A symbolic representation of the matrix M relating the basis F-statistics and graph edge length as F=M*b where F is the vector of the basis Fstats (row names of model.matrix M) and b is the vector of graph edges (column names of model.matrix M).
"omega": A symbolic representation of the scaled covariance matrix of allele frequency with edge names and admixture parameter names as specified in the edges.names and adm.params.names slot of the input graph.params object x
"F2.equations": A symbolic representation of the nleaves(nleaves-1)/2 different F2 as a function of graph parameters
"F3.equations": A symbolic representation of the nleaves(nleaves-1)(nleaves-2)/2 different F3 as a function of graph parameters
"F4.equations": A symbolic representation of the npops(npops-1)(npops-2)(npops-3)/8 different F4 as a function of graph parameters
To generate a graph.params object, see generate.graph.params
.
graph=rbind(c("P1","P7",""),c("P2","s1",""),c("P3","s2",""),c("P6","S",""), c("S","s1","a"),c("S","s2","(1-a)"),c("s2","P8",""),c("s1","P7",""), c("P4","P9",""),c("P5","P9",""),c("P7","P8",""), c("P8","R",""),c("P9","R","")) graph.params=generate.graph.params(graph) graph.equations=graph.params2symbolic.fstats(graph.params)
graph=rbind(c("P1","P7",""),c("P2","s1",""),c("P3","s2",""),c("P6","S",""), c("S","s1","a"),c("S","s2","(1-a)"),c("s2","P8",""),c("s1","P7",""), c("P4","P9",""),c("P5","P9",""),c("P7","P8",""), c("P8","R",""),c("P9","R","")) graph.params=generate.graph.params(graph) graph.equations=graph.params2symbolic.fstats(graph.params)
Show pairwisefst object
## S4 method for signature 'pairwisefst' heatmap( x, Rowv = NULL, Colv = if (symm) "Rowv" else NULL, distfun = as.dist, hclustfun = hclust, reorderfun = function(d, w) reorder(d, w), add.expr, symm = FALSE, revC = identical(Colv, "Rowv"), scale = c("row", "column", "none"), na.rm = TRUE, margins = c(5, 5), ColSideColors, RowSideColors, cexRow = 0.2 + 1/log10(nrow(x@PairwiseFSTmatrix)), cexCol = 0.2 + 1/log10(ncol(x@PairwiseFSTmatrix)), labRow = NULL, labCol = NULL, main = NULL, xlab = NULL, ylab = NULL, keep.dendro = FALSE, verbose = getOption("verbose"), ... )
## S4 method for signature 'pairwisefst' heatmap( x, Rowv = NULL, Colv = if (symm) "Rowv" else NULL, distfun = as.dist, hclustfun = hclust, reorderfun = function(d, w) reorder(d, w), add.expr, symm = FALSE, revC = identical(Colv, "Rowv"), scale = c("row", "column", "none"), na.rm = TRUE, margins = c(5, 5), ColSideColors, RowSideColors, cexRow = 0.2 + 1/log10(nrow(x@PairwiseFSTmatrix)), cexCol = 0.2 + 1/log10(ncol(x@PairwiseFSTmatrix)), labRow = NULL, labCol = NULL, main = NULL, xlab = NULL, ylab = NULL, keep.dendro = FALSE, verbose = getOption("verbose"), ... )
x |
Object of class pairwisefst |
Rowv |
determines if and how the row dendrogram should be computed and reordered. Either a dendrogram or a vector of values used to reorder the row dendrogram or NA to suppress any row dendrogram (and reordering) or by default, NULL, see ‘Details’ below. |
Colv |
determines if and how the column dendrogram should be reordered. Has the same options as the Rowv argument above and additionally when x is a square matrix, Colv = "Rowv" means that columns should be treated identically to the rows (and so if there is to be no row dendrogram there will not be a column one either). |
distfun |
function used to compute the distance (dissimilarity) between both rows and columns. Defaults to as.dist. |
hclustfun |
function used to compute the hierarchical clustering when Rowv or Colv are not dendrograms. Defaults to hclust. Should take as argument a result of distfun and return an object to which as.dendrogram can be applied. |
reorderfun |
function(d, w) of dendrogram and weights for reordering the row and column dendrograms. The default uses reorder.dendrogram. |
add.expr |
expression that will be evaluated after the call to image. Can be used to add components to the plot. |
symm |
logical indicating if x should be treated symmetrically; can only be true when x is a square matrix. |
revC |
logical indicating if the column order should be reversed for plotting, such that e.g., for the symmetric case, the symmetry axis is as usual. |
scale |
character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. The default is "row" if symm false, and "none" otherwise. |
na.rm |
logical indicating whether NA's should be removed. |
margins |
numeric vector of length 2 containing the margins (see par(mar = *)) for column and row names, respectively. |
ColSideColors |
(optional) character vector of length ncol(x) containing the color names for a horizontal side bar that may be used to annotate the columns of x. |
RowSideColors |
(optional) character vector of length nrow(x) containing the color names for a vertical side bar that may be used to annotate the rows of x. |
cexRow , cexCol
|
positive numbers, used as cex.axis in for the row or column axis labeling. The defaults currently only use number of rows or columns, respectively. |
labRow , labCol
|
character vectors with row and column labels to use; these default to rownames(x) or colnames(x), respectively. |
main , xlab , ylab
|
main, x- and y-axis titles; defaults to none. |
keep.dendro |
logical indicating if the dendrogram(s) should be kept as part of the result (when Rowv and/or Colv are not NA). |
verbose |
logical indicating if information should be printed. |
... |
additional arguments passed on to image, e.g., col specifying the colors. |
Check countdata objects
is.countdata(x)
is.countdata(x)
x |
The name of the object to be tested |
Check fitted.graph objects
is.fitted.graph(x)
is.fitted.graph(x)
x |
Object to be tested |
Check fstats objects
is.fstats(x)
is.fstats(x)
x |
The name of the object to be tested |
Check graph.params objects
is.graph.params(x)
is.graph.params(x)
x |
The name (or a path) of the graph.params objet |
Check pairwisefst objects
is.pairwisefst(x)
is.pairwisefst(x)
x |
The name (or a path) of the pairwisefst object |
Check pooldata objects
is.pooldata(x)
is.pooldata(x)
x |
The name of the object to be tested |
Write in the current directory example files corresponding to a sync (as obtained when parsing mpileup files with PoPoolation) and vcf (as obtained when parsing mpileup files with VarScan) gzipped files
make.example.files(writing.dir = "")
make.example.files(writing.dir = "")
writing.dir |
Directory where to copy example files (e.g., set writing.dir=getwd() to copy in the current working directory) |
make.example.files(writing.dir=tempdir())
make.example.files(writing.dir=tempdir())
S4 class to represent a pairwise Fst results obtained with the compute.pairwiseFST
values
A data frame with npop*(npop-1)/2 rows and 3 (or 7 if blockjackknife is TRUE) columns containing for both the Fst and Q2, estimates over all the SNPs and if blockjackknife=TRUE, the estimated block-jackknife and standard error (s.e.). The seventh (or third if blockjackknife=FALSE) column gives the number of SNPs.
PairwiseFSTmatrix
A npxnp matrix containing the pairwise FST estimates
PairwiseSnpFST
A matrix (nsnp rows and npops columns) with read count data for the reference allele
PairwiseSnpQ1
A matrix (nsnp rows and npops columns) with overall read coverage
PairwiseSnpQ2
A matrix (nsnp rows and 4 columns) detailing for each SNP, the chromosome (or scaffold), the position, allele 1 and allele 2
blockjacknife
A logical indicating whether block-jackknife estimates of standard errors are available (TRUE) or not (FALSE)
To generate pairwise object, see compute.pairwiseFST
Plot F2, F3, F3star, F4, D or pairwise Fst values with their Confidence Intervals
plot_fstats( x, stat.name = "F2", ci.perc = 95, value.range = c(NA, NA), pop.sel = NA, pop.f3.target = NA, highlight.signif = TRUE, main = stat.name, ... )
plot_fstats( x, stat.name = "F2", ci.perc = 95, value.range = c(NA, NA), pop.sel = NA, pop.f3.target = NA, highlight.signif = TRUE, main = stat.name, ... )
x |
An object of class fstats (to plot heterozygosities, divergence, F2, F3, F3star, F4 or D statistics) or pairwisefst (to plot pairwise fst) |
stat.name |
For fstats object, the name of the stat (either heterozygosities, divergence, F2, F3, F3star, F4 or Dstat) |
ci.perc |
Percentage of the Confidence Interval in number of standard errors (default=95%) |
value.range |
Range of test values (x-axis) to be plotted (default=NA,NA: i.e., all test values are plotted) |
pop.sel |
Only plot test values involving these populations (default=NA: i.e., all test values are plotted) |
pop.f3.target |
For F3-statistics, only plot F3 involving pop.f3.target as a target |
highlight.signif |
If TRUE highlight significant tests in red (see details) |
main |
Main title of the plot (default=stat.name) |
... |
Some other graphical arguments to be passed |
Data will only be plotted if jackknife estimates of the estimator s.e. have been performed i.e. if the functions compute.fstats or compute.pairwiseFST were run with nsnp.per.block>0
A plot of the Fstats of interest. Significant F3 statistics (i.e., showing formal evidence for admixture of the target population) are highlighted in red. Significant F4 statistics (i.e., showing formal evidence against treeness of the pop. quadruplet) are highlighted in red.
To generate x object, see compute.pairwiseFST
(for pairwisefst object) or compute.fstats
(for fstats object)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"), poolsizes=rep(50,15),poolnames=paste0("P",1:15)) res.fstats=compute.fstats(pooldata,nsnp.per.bjack.block=25) plot_fstats(res.fstats,stat.name="F3",cex=0.5) plot_fstats(res.fstats,stat.name="F3",value.range=c(NA,0.001), pop.f3.target=c("P7","P5"),cex.axis=0.7) plot_fstats(res.fstats,stat.name="F4",cex=0.5) #allow to reduce the size of the test name (y-axis) plot_fstats(res.fstats,stat.name="F4",cex=0.5, pop.sel=c("P1","P2","P3","P4","P5")) plot_fstats(res.fstats,stat.name="F4",cex=0.5, pop.sel=c("P1","P2","P3","P4","P5"),highlight.signif=FALSE)
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"), poolsizes=rep(50,15),poolnames=paste0("P",1:15)) res.fstats=compute.fstats(pooldata,nsnp.per.bjack.block=25) plot_fstats(res.fstats,stat.name="F3",cex=0.5) plot_fstats(res.fstats,stat.name="F3",value.range=c(NA,0.001), pop.f3.target=c("P7","P5"),cex.axis=0.7) plot_fstats(res.fstats,stat.name="F4",cex=0.5) #allow to reduce the size of the test name (y-axis) plot_fstats(res.fstats,stat.name="F4",cex=0.5, pop.sel=c("P1","P2","P3","P4","P5")) plot_fstats(res.fstats,stat.name="F4",cex=0.5, pop.sel=c("P1","P2","P3","P4","P5"),highlight.signif=FALSE)
plot pairwisefst object
## S4 method for signature 'fitted.graph' plot(x, y)
## S4 method for signature 'fitted.graph' plot(x, y)
x |
Object of class fitted.graph |
y |
dummy argument |
plot fstats object
## S4 method for signature 'fstats' plot(x, y, ...)
## S4 method for signature 'fstats' plot(x, y, ...)
x |
Object of class fstats |
y |
dummy argument |
... |
Other arguments to be passed to plot_fstats |
see plot_fstats
for details on plot_fstats arguments
plot graph in graph.params object
## S4 method for signature 'graph.params' plot(x, y)
## S4 method for signature 'graph.params' plot(x, y)
x |
Object of class fitted.graph |
y |
dummy argument |
plot pairwisefst object
## S4 method for signature 'pairwisefst' plot(x, y, ...)
## S4 method for signature 'pairwisefst' plot(x, y, ...)
x |
Object of class pairwisefst |
y |
dummy argument |
... |
Some arguments to be passed to plot_fstats |
see plot_fstats
for details on plot_fstats arguments
S4 class to represent a Pool-Seq data set.
npools
The number of pools
nsnp
The number of SNPs
refallele.readcount
A matrix (nsnp rows and npools columns) with read count data for the reference allele
readcoverage
A matrix (nsnp rows and npools columns) with overall read coverage
snp.info
A data frame (nsnp rows and 4 columns) detailing for each SNP, the chromosome (or scaffold), the position, Reference allele name and Alternate allele name (if available)
poolsizes
A vector of length npools with the corresponding haploid pool sizes
poolnames
A vector of length npools with the corresponding haploid pool names
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
, genobaypass2pooldata
and genoselestim2pooldata
Create a subset of the pooldata object that contains Pool-Seq data as a function of pool and/or SNP indexes
pooldata.subset( pooldata, pool.index = 1:pooldata@npools, snp.index = 1:pooldata@nsnp, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, cov.qthres.per.pool = c(0, 1), return.snp.idx = FALSE, verbose = TRUE )
pooldata.subset( pooldata, pool.index = 1:pooldata@npools, snp.index = 1:pooldata@nsnp, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = -1, cov.qthres.per.pool = c(0, 1), return.snp.idx = FALSE, verbose = TRUE )
pooldata |
A pooldata object containing Pool-Seq information |
pool.index |
Indexes of the pools (at least two), that should be selected to create the new pooldata object (default=all the pools) |
snp.index |
Indexes of the SNPs (at least two), that should be selected to create the new pooldata object (default=all the SNPs) |
min.cov.per.pool |
Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded |
max.cov.per.pool |
Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio over all read counts for the reference allele over the read coverage) |
cov.qthres.per.pool |
A two-elements vector containing the minimal (qmin) and maximal (qmax) quantile coverage thresholds applied to each pools (0<=qmin<qmax<=1). See details below |
return.snp.idx |
If TRUE, the row.names of the snp.info slot of the returned pooldata object are named as "rsx" where x is the index of SNP in the initial pooldata object (default=FALSE) |
verbose |
If TRUE return some information |
This function allows subsetting a pooldata object by selecting only some pools and/or some SNPs (e.g., based on their position on the genome). Additional filtering steps on SNPs can be carried out on the resulting subset to discard SNP with low polymorphism or poorly or too highly covered. In addition, coverage criteria can be applied on a per-pool basis with the cov.qthres.per.pool argument. 'more specific SNP selection based on their positions on the genome or their characteristics. For instance if qmax=0.95, a position is discarded if in a given pool it has a number of reads higher than the 95-th percentile of the empirical coverage distribution in this same pool (defined over the SNPs selected by snp.index). Similarly, if qmax=0.05, a position is discarded if in a given pool it has a number of reads lower than the 5-th percentile of the empirical coverage distribution in this same pool. This mode of selection may be more relevant when considering pools with heterogeneous read coverages.
A pooldata object with 7 elements:
"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele in the reference assembly (3rd column); the allele taken as reference in the refallele matrix.readcount matrix (4th column); and the alternative allele (5th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) subset.by.pools=pooldata.subset(pooldata,pool.index=c(1,2)) subset.by.snps=pooldata.subset(pooldata,snp.index=10:100) subset.by.pools.and.snps=pooldata.subset(pooldata,pool.index=c(1,2),snp.index=10:100) subset.by.pools.qcov.thr=pooldata.subset(pooldata,pool.index=1:8,cov.qthres.per.pool=c(0.05,0.95))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) subset.by.pools=pooldata.subset(pooldata,pool.index=c(1,2)) subset.by.snps=pooldata.subset(pooldata,snp.index=10:100) subset.by.pools.and.snps=pooldata.subset(pooldata,pool.index=c(1,2),snp.index=10:100) subset.by.pools.qcov.thr=pooldata.subset(pooldata,pool.index=1:8,cov.qthres.per.pool=c(0.05,0.95))
Convert a pooldata object into DIYABC data file for pool-seq data. A file containing SNP details is also printed out. Options to generate sub-samples (e.g., for large number of SNPs) are also available. Note that DIYABC SNP filtering criterion is based on MRC (minimal read count) which may be more stringent than usual MAF-based filtering criterion. It is recommended to parse vcf files and pooldata objects without any MAF criterion or to prefilter pooldata objects with the desired MRC (using option snp.index pooldata.subset
).
pooldata2diyabc( pooldata, writing.dir = getwd(), prefix = "", diyabc.mrc = 1, subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata2diyabc( pooldata, writing.dir = getwd(), prefix = "", diyabc.mrc = 1, subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata |
A pooldata object containing Pool-Seq information (see |
writing.dir |
Directory where to create the files (e.g., set writing.dir=getwd() to copy in the current working directory) |
prefix |
Prefix used for output file names |
diyabc.mrc |
MRC to be applied by DIYABC (note that no filtering based on MRC is done by the function) |
subsamplesize |
Size of the sub-samples. If <=1 (default), all the SNPs are considered in the output |
subsamplingmethod |
If sub-sampling is activated (argument subsamplesize), define the method used for subsampling that might be either i) "random" (A single data set consisting of randmly chosen SNPs is generated) or ii) "thinning", sub-samples are generated by taking SNPs one every nsub=floor(nsnp/subsamplesize) in the order of the map (a suffix ".subn" is added to each sub-sample files where n varies from 1 to nsub). |
DIYABC data file for pool-seq data
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2diyabc(pooldata=pooldata,writing.dir=tempdir())
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2diyabc(pooldata=pooldata,writing.dir=tempdir())
Convert a pooldata object into BayPass allele read count and haploid pool size files. A file containing SNP details is also printed out. Options to generate sub-samples (e.g., for large number of SNPs) are also available.
pooldata2genobaypass( pooldata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata2genobaypass( pooldata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata |
A pooldata object containing Pool-Seq information (see |
writing.dir |
Directory where to create the files (e.g., set writing.dir=getwd() to copy in the current working directory) |
prefix |
Prefix used for output file names |
subsamplesize |
Size of the sub-samples. If <=1 (default), all the SNPs are considered in the output |
subsamplingmethod |
If sub-sampling is activated (argument subsamplesize), define the method used for subsampling that might be either i) "random" (A single data set consisting of randmly chosen SNPs is generated) or ii) "thinning", sub-samples are generated by taking SNPs one every nsub=floor(nsnp/subsamplesize) in the order of the map (a suffix ".subn" is added to each sub-sample files where n varies from 1 to nsub). |
Files containing allele count (in BayPass format), haploid pool size (in BayPass format), and SNP details (as in the snp.info matrix from the pooldata object)
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir())
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genobaypass(pooldata=pooldata,writing.dir=tempdir())
Convert a pooldata object into SelEstim allele read count. A file containing SNP details is also printed out. Options to generate sub-samples (e.g., for large number of SNPs) are also available.
pooldata2genoselestim( pooldata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata2genoselestim( pooldata, writing.dir = getwd(), prefix = "", subsamplesize = -1, subsamplingmethod = "thinning" )
pooldata |
A pooldata object containing Pool-Seq information (see |
writing.dir |
Directory where to create the files (e.g., set writing.dir=getwd() to copy in the current working directory) |
prefix |
Prefix used for output file names |
subsamplesize |
Size of the sub-samples. If <=1 (default), all the SNPs are considered in the output |
subsamplingmethod |
If sub-sampling is activated (argument subsamplesize), define the method used for subsampling that might be either i) "random" (A single data set consisting of randmly chosen SNPs is generated) or ii) "thinning", sub-samples are generated by taking SNPs one every nsub=floor(nsnp/subsamplesize) in the order of the map (a suffix ".subn" is added to each sub-sample files where n varies from 1 to nsub). |
Files containing allele count (in SelEstim Pool-Seq format) and SNP details (as in the snp.info matrix from the pooldata object)
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genoselestim(pooldata=pooldata,writing.dir=tempdir())
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) pooldata2genoselestim(pooldata=pooldata,writing.dir=tempdir())
Compute the index of the pairwise comparison from the idx of each pop
idx_pop1 |
Integer giving the (0-indexed) index of the first pop |
idx_pop2 |
Integer giving the (0-indexed) index of the second pop |
nidx |
Integer giving the total number of indexes (i.e., number of pops) |
If idx_pop2 < idx_pop1, indexes are reversed
Return the (0-indexed) index for the row associated to the pairwise comparison in the ordered flat list of all (npop*(npop-1))/2 pairwise stats
#
#
Convert Popoolation Sync files into a pooldata object
popsync2pooldata( sync.file = "", poolsizes = NA, poolnames = NA, min.rc = 1, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = 0.01, noindel = TRUE, nlines.per.readblock = 1e+06, nthreads = 1 )
popsync2pooldata( sync.file = "", poolsizes = NA, poolnames = NA, min.rc = 1, min.cov.per.pool = -1, max.cov.per.pool = 1e+06, min.maf = 0.01, noindel = TRUE, nlines.per.readblock = 1e+06, nthreads = 1 )
sync.file |
The name (or a path) of the Popoolation sync file (might be in compressed format) |
poolsizes |
A numeric vector with haploid pool sizes |
poolnames |
A character vector with the names of pool |
min.rc |
Minimal allowed read count per base. Bases covered by less than min.rc reads are discarded and considered as sequencing error. For instance, if nucleotides A, C, G and T are covered by respectively 100, 15, 0 and 1 over all the pools, setting min.rc to 0 will lead to discard the position (the polymorphism being considered as tri-allelic), while setting min.rc to 1 (or 2, 3..14) will make the position be considered as a SNP with two alleles A and C (the only read for allele T being disregarded). |
min.cov.per.pool |
Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded |
max.cov.per.pool |
Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overal read counts for the reference allele over the read coverage) |
noindel |
If TRUE, positions with at least one indel count are discarded |
nlines.per.readblock |
Number of Lines read simultaneously. Should be adapted to the available RAM. |
nthreads |
Number of available threads for parallelization of some part of the parsing (default=1, i.e., no parallelization) |
A pooldata object containing 7 elements:
"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.readcount matrix (3rd column); and the alternative allele (4th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15))
PCA of a pooldata or countdata object using a random allele approach
randomallele.pca( x, scale = TRUE, return.snploadings = FALSE, plot.pcs = c(1, 2), ... )
randomallele.pca( x, scale = TRUE, return.snploadings = FALSE, plot.pcs = c(1, 2), ... )
x |
A pooldata object containing Pool-Seq information or a countdata object containing allele count information |
scale |
If FALSE the random allele data matrix is not scaled (default=TRUE) |
return.snploadings |
If TRUE return the SNP loadings (may be large) |
plot.pcs |
A vector with two-elements giving the two PCs to plot. If NULL, no plotting is done. |
... |
graphical parameters (see |
PCA is performed by singular-value decomposition (SVD) of a npop (or npools) x nsnp matrix of a single randomly sampled allele (i.e. or read for pooldata object) for each SNP and for each population (inspired by Skoglund and Jakobsson, 2011, https://doi.org/10.1073/pnas.1108181108). Although this approach leads to information loss, it allows to efficiently account for unequal sample size (and read coverages for pool-seq data) and have little impact on the resulting representation when the number of SNPs is large. Note also that the implemented approach is similar to that implemented in the PCA_MDS module of the software ANGSD by Korneliussen et al. (2014) (see http://www.popgen.dk/angsd/index.php/PCA_MDS).
An object of class fstats (see help(fstats) for details)
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata<-popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.pca<-randomallele.pca(pooldata)
make.example.files(writing.dir=tempdir()) pooldata<-popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15)) res.pca<-randomallele.pca(pooldata)
Construct and root an Neighbor-Joining tree of presumably nonadmixed leaves
rooted.njtree.builder( fstats, pop.sel, edge.fact = 1000, plot.nj = FALSE, verbose = TRUE )
rooted.njtree.builder( fstats, pop.sel, edge.fact = 1000, plot.nj = FALSE, verbose = TRUE )
fstats |
Object of class fstats that contains estimates of the fstats (see compute.fstats) |
pop.sel |
Names of the leaves (pops) used to build the nj tree (at least 3 required) |
edge.fact |
The multiplying factor of edges length in graph representation |
plot.nj |
If TRUE plot the Neighbor-Joining tree |
verbose |
If TRUE extra information is printed on the terminal |
A Neighbor-Joining tree is first built (using nj function from the package ape) based on the F2-distance matrix of the leaves in pop.sel which are presumably non-admixed (see the function find.tree.popset to find such groups of scaffold populations using estimated F3 and F4 test statistics). For non-admixed leaves, F2 are indeed expected to be additive along the resulting binary tree (see Lipson et al., 2013). The resulting tree is then rooted using the method described in Lipson et al. (2013) which is based on the property that the estimated heterozygosity of the root h_R equals h_R=1-Q2(A,B) if A and B are two populations sharing R as the only common ancestor in the tree. This estimator should then be consistent across all the possible pairs of populations A and B that are only connected through R in the tree (i.e., that each belong to one of the two partitions of the tree defined by a root position R). Note that 1-Q2(A,B)=(1-Q1(A))/2 + (1-Q1(B))/2 + F2(A,B)=(h_A+h_B)/2+F2(A,B) where h_A, h_B and F2(A,B) are estimated with the function compute.fstats.
A list with the following elements:
"n.rooted.trees": The number of possible rooted binary trees that were evaluated
"fitted.rooted.trees.list": a list of objects of class fitted.graph containing information on all the possible graphs (indexed from 1 to n.rooted.trees). Each tree may be visualized or further used using functions applied to objects of class fitted.graph (e.g., plot, add.leave)
best.rooted.tree The tree (object of class fitted.graph) among all the graphs within fitted.rooted.trees.list displaying the minimal the minimal sd over estimates of h_P (see details)
"root.het.est.var": For a matrix of n.tree rows (same order as in the list rooted.tree) and 4 columns with i) the average estimated root heterozygosity h_R across all the pairs of population leave that are relevant for estimation (see details); ii) the size of the range of variation and iii) the s.d. of the estimates of h_R, and iv) the number of population pairs relevant for estimation
"nj.tree.eval": If n.edges>3, gives the five worst configuration fit (by calling the compare.fitted.fstats function) which are the same irrespective of rooting
see fit.graph
, generate.graph.params
and add.leaf
.
Scan allele information in ALT field of a vcf
.scan_allele_info(allele_info)
.scan_allele_info(allele_info)
allele_info |
a character string vector (ALT field of the vcf) |
Scan allele information in ALT field of a vcf to identify the number of alleles and if there is indels
Return a vector with two elements consisting i) the number of alleles (1+number of comma) and ii) 0 or 1 if an indel is detected
.scan_allele_info(c("A,C","T","AAT"))
.scan_allele_info(c("A,C","T","AAT"))
Show countdata object
## S4 method for signature 'countdata' show(object)
## S4 method for signature 'countdata' show(object)
object |
Object of class countdata |
Show fitted.graph object
## S4 method for signature 'fitted.graph' show(object)
## S4 method for signature 'fitted.graph' show(object)
object |
Object of class fitted.graph |
Show fstats object
## S4 method for signature 'fstats' show(object)
## S4 method for signature 'fstats' show(object)
object |
Object of class fstats |
Show graph.params object
## S4 method for signature 'graph.params' show(object)
## S4 method for signature 'graph.params' show(object)
object |
Object of class graph.params |
Show pairwisefst object
## S4 method for signature 'pairwisefst' show(object)
## S4 method for signature 'pairwisefst' show(object)
object |
Object of class pairwisefst |
Show pooldata object
## S4 method for signature 'pooldata' show(object)
## S4 method for signature 'pooldata' show(object)
object |
Object of class pooldata |
Simulate read counts from count data and return a pooldata object
sim.readcounts( x, lambda.cov = rep(50, x@npops), overdisp = 1, seq.eps = 0, exp.eps = 0, maf.thr = 0, min.rc = 2, genome.size = 0, verbose = TRUE )
sim.readcounts( x, lambda.cov = rep(50, x@npops), overdisp = 1, seq.eps = 0, exp.eps = 0, maf.thr = 0, min.rc = 2, genome.size = 0, verbose = TRUE )
x |
A countdata object containing allele count information |
lambda.cov |
Numeric vector of length npop giving the expected coverage of each pool |
overdisp |
Numeric value giving overdispersion of coverages (see details) |
seq.eps |
Numeric value giving the sequencing error rate |
exp.eps |
Numeric value giving the experimental error leading to unequal contribution of individual to the pool reads |
maf.thr |
Float giving the MAF threshold for SNP filtering |
min.rc |
Integer giving the minimal read count for an allele to be considered as true allele |
genome.size |
Size of the genome (only considered when seq.eps>0 to simulated spurious SNPs generated at monomorphic position) |
verbose |
If TRUE extra information is printed on the terminal |
The function implements a simulation approach similar to that described in Gautier et al. (2021). Read coverages are sampled from a distribution specified by the lambda.cov vector and the overdisp scalar. Note that overdisp is the same for all pop sample but the expected coveragese (specified in the lambda.cov vector) may vary across pool. If overdisp=1 (default), coverages are assumed Poisson distributed with mean (and variance) equal to the value specified in the lambda.cov vector. If overdisp>1, coverages follows a Negative Binomial distribution with a mean equal to lambda and variance equal to overdisp*lambda. Finally, if overdisp<1, no variation in coverage is introduced and all coverages are equal to the value specified in the lambda vector although they may (slightly) vary in the output when seq.eps>0 due to the removal of error reads. The seq.eps parameter control sequencing error rate. Sequencing errors are modeled following Gautier et al. (2021) i.e. read counts for the four possible bases are sampled from a multinomial distribution Multinom(c,{f*(1-eps)+(1-f)*eps/3;f*eps/3+(1-f)*(1-eps),eps/3,eps/3}) where c is the read coverage and f the reference allele frequencies (obtained from the count data). When seq.eps>0, spurious SNPs may be generated at monomorphic positions (the number of which being equal to the size of the genome, provided with the genome.size argument, minus the number of SNPs in the countdata object). These spurious SNPs are simulated using the same error model (Multinom(c,{1-eps,eps/3,eps/3,eps/3}). Only bi-allelic SNPs passing filtering conditions specified by min.rc (which controls the minimal read count for an allele to be deemed as true, i.e. if more than two alleles have >= min.rc counts then the SNP is excluded because non-bi-allelic) and maf.thr (threshold on the major allele frequency computed over all read counts) are included in the output. Experimental error exp.eps control the contribution of individual (assumed diploid) to the pools following the model described in Gautier et al. (2013). The parameter exp.eps corresponds to the coefficient of variation of the individual contributions. For example, in a pool of 10 individuals and a Poisson distributed coverage of mean 100, exp.eps=0.5 correspond to a situation where the 5 most contributing individuals contribute $>2$ times reads than the others. When exp.eps tends toward 0, all individuals contribute equally to the pool and there is no experimental error. Note that the number of (diploid) individuals for each SNP and pop. sample is deduced from the input total count (it may thus differ over SNP when the total counts are not the same).
A pooldata object containing simulated read counts
To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
#not run
#not run
Simulate read counts for monomorphic position when there is sequencing error
.simureads_mono(npos, npop, lambda, overdisp, min_rc, min_maf, eps)
.simureads_mono(npos, npop, lambda, overdisp, min_rc, min_maf, eps)
npos |
Integer giving the number of positions (close to genome size) |
npop |
Integer giving the number of population samples |
lambda |
Numeric Vector of length npop giving the expected coverage of each pool |
overdisp |
Numeric value giving overdispersion of coverages and their distribution (see details) |
min_rc |
Integer giving the minimal read count for an allele to be considered as true allele |
min_maf |
Float giving the MAF threshold for SNP filtering |
eps |
Numeric value giving the sequencing error |
The function implements a simulation approach similar to that described in Gautier et al. (2021). Read coverages are sampled from a distribution specified by the lambda and overdisp vectors. Note that overdisp is the same for all pop sample but lambda (expected coverages) may vary across pool. If overdisp=1 (default in the R function), coverages are assumed Poisson distributed and the mean and variance of the coverages for the pool are both equal to the value specified in the lambda vector. If overdisp>1, coverages follows a Negative Binomial distribution with a mean equal the lamda but a variance equal to overdisp*lambda. Finally, if overdisp<1, no variation in coverage is introduced and all coverages are equal to the value specified in the lambda vector although they may (slightly) vary in the output when eps>0 due to the removal of error reads. The eps parameter control sequencing error rate. Sequencing errors are modeled following Gautier et al. (2021) i.e. read counts for the four possible bases are sampled from a multinomial distribution Multinom(c,{1-eps;eps/3,eps/3,eps/3}) where c is the read coverage. Only bi-allelic SNPs (after considering min_rc) satisfying with MAF>min_maf are included in the output.
Return an Integer matrix with nsnp rows and 2*npop columns (1:npop=ref allele readcount; (npop+1):2*npop=coverage)
#
#
Simulate read counts from count data
.simureads_poly( y_count, n_count, lambda, overdisp, min_rc, min_maf, eps, eps_exp )
.simureads_poly( y_count, n_count, lambda, overdisp, min_rc, min_maf, eps, eps_exp )
y_count |
Integer Matrix with nsnp rows and npop columns giving allele counts at the reference allele |
n_count |
Integer Matrix with nsnp rows and npop columns giving total counts |
lambda |
Numeric Vector of length npop giving the expected coverage of each pool |
overdisp |
Numeric value giving overdispersion of coverages and their distribution (see details) |
min_rc |
Integer giving the minimal read count for an allele to be considered as true allele |
min_maf |
Float giving the MAF threshold for SNP filtering |
eps |
Numeric value giving the sequencing error |
eps_exp |
Numeric value giving the experimental error leading to unequal contribution of individual to the pool reads |
The function implements a simulation approach similar to that described in Gautier et al. (2021). Read coverages are sampled from a distribution specified by the lambda and overdisp vectors. Note that overdisp is the same for all pop sample but lambda (expected coverages) may vary across pool. If overdisp=1 (default in the R function), coverages are assumed Poisson distributed and the mean and variance of the coverages for the pool are both equal to the value specified in the lambda vector. If overdisp>1, coverages follows a Negative Binomial distribution with a mean equal the lamda but a variance equal to overdisp*lambda. Finally, if overdisp<1, no variation in coverage is introduced and all coverages are equal to the value specified in the lambda vector although they may (slightly) vary in the output when eps>0 due to the removal of error reads. The eps parameter control sequencing error rate. Sequencing errors are modeled following Gautier et al. (2021) i.e. read counts for the four possible bases are sampled from a multinomial distribution Multinom(c,{f*(1-eps)+(1-f)*eps/3;f*eps/3+(1-f)*(1-eps),eps/3,eps/3}) where c is the read coverage and f the reference allele frequencies (obtained from the count data). Experimental error eps_exp control the contribution of individual (assumed diploid) to the pools following the model described in Gautier et al. (2013). The parameter eps_exp corresponds to the coefficient of variation of the individual contributions When eps_exp tends toward 0, all individuals contribute equally to the pool and there is no experimental error. For example, with 10 individuals, eps_exp=0.5 correspond to a situation where 5 individuals contribute 2.8x more reads than the five others. Note that the number of (diploid) individuals for each SNP and pop. sample is deduced from the input total count (it may thus differ over SNP when the total counts are not the same).
Return an Integer matrix with nsnp rows and 2*npop columns (1:npop=ref allele readcount; (npop+1):2*npop=coverage)
#
#
Compute sliding window estimates of F-statistics or ratio of F-statistics over the genome
sliding.windows.fstat( x, num.pop.idx = NULL, den.pop.idx = NULL, num.stat = NULL, den.stat = NULL, window.def = c("nsnp", "bp")[1], sliding.window.size = NULL, window.overlap.fact = 0.5, bp.start.first.snp = TRUE, verbose = TRUE )
sliding.windows.fstat( x, num.pop.idx = NULL, den.pop.idx = NULL, num.stat = NULL, den.stat = NULL, window.def = c("nsnp", "bp")[1], sliding.window.size = NULL, window.overlap.fact = 0.5, bp.start.first.snp = TRUE, verbose = TRUE )
x |
A pooldata object containing Pool-Seq information or a countdata object containing allele count information |
num.pop.idx |
A vector of length 1 to 4 (depending on num.stat) giving the index of the populations. If num.stat="het", num.pop.idx must be of length 1: num.pop.idx=i specifies the ith pop in x. If num.stat="div", "F2" or "Fst", num.pop.idx must be of length 2: num.pop.idx=c(i,j) specifies the pairs of populations with indexes i and j in x. If num.stat="F3" or "F3star", num.pop.idx must be of length 3 (num.pop.idx=c(i,j,k) specifies the F3(pop_i;pop_j,pop_k) populations triplet). Finally, if num.stat="F4" or "Dstat", num.pop.idx must be of length 4: num.pop.idx=c(i,j,k) specifies the F4(pop_i,pop_j;pop_k,pop_l) populations quadruplet i.e. the computed (numerator) statistic computed is (F2(pop_i,pop_k)-F2(pop_i,pop_l)-F2(pop_j,pop_k)+F2(pop_j,pop_l))/2. |
den.pop.idx |
A vector of length 1 to 4 (see num.pop.idx description) giving the index of the populations specifying the F-statistic. If NULL, the computed statistic is the one specified by num.pop.idx. |
num.stat |
the name of the (numerator) stat which must be "het" (1-Q1), "div" (1-Q2), "F2", "Fst", "F3", "F3star", "F4" or "Dstat" |
den.stat |
the name of the (numerator) stat which must be "het" (1-Q1), "div" (1-Q2), "F2", "Fst", "F3", "F3star", "F4" or "Dstat" |
window.def |
Either "nsnp" or "bp" to define windows by either a number of SNPs or a size in bp, respectively |
sliding.window.size |
A numeric value giving the number of SNPs or the size (in bp) of the windows depending window.def |
window.overlap.fact |
A numeric value (between 0 and 1) giving the percentage of overlap between consecutive windows (default=0.5) |
bp.start.first.snp |
When window.def="bp", if TRUE (default) the windowing start at the first SNP position, if FALSE the windowing start at position 1 |
verbose |
If TRUE extra information is printed on the terminal |
Compute sliding window estimates of F-statistics or ratio of F-statistics over the genome.
A data frame with 7 columns with for each window in a row their i) chromosome/contig of origin; ii) start and iii) end position; iv) the mid-position of each window; v) the cumulated mid-position of each window (to facilitate further plotting); vi) the number of SNPs included in the computation of window value; and vii) the estimated value of the statistic
To generate pooldata object, see vcf2pooldata
, popsync2pooldata
,genobaypass2pooldata
or genoselestim2pooldata
. To generate coundata object, see genobaypass2countdata
or genotreemix2countdata
.
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15))
make.example.files(writing.dir=tempdir()) pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15))
Convert VCF files into a pooldata object.
vcf2pooldata( vcf.file = "", poolsizes = NA, poolnames = NA, min.cov.per.pool = -1, min.rc = 1, max.cov.per.pool = 1e+06, min.maf = -1, remove.indels = FALSE, min.dist.from.indels = 0, nlines.per.readblock = 1e+06, verbose = TRUE )
vcf2pooldata( vcf.file = "", poolsizes = NA, poolnames = NA, min.cov.per.pool = -1, min.rc = 1, max.cov.per.pool = 1e+06, min.maf = -1, remove.indels = FALSE, min.dist.from.indels = 0, nlines.per.readblock = 1e+06, verbose = TRUE )
vcf.file |
The name (or a path) of the Popoolation sync file (might be in compressed format) |
poolsizes |
A numeric vector with haploid pool sizes |
poolnames |
A character vector with the names of pool |
min.cov.per.pool |
Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded |
min.rc |
Minimal allowed read count per base (options silenced for VarScan vcf). Bases covered by less than min.rc reads are discarded and considered as sequencing error. For instance, if nucleotides A, C, G and T are covered by respectively 100, 15, 0 and 1 over all the pools, setting min.rc to 0 will lead to discard the position (the polymorphism being considered as tri-allelic), while setting min.rc to 1 (or 2, 3..14) will make the position be considered as a SNP with two alleles A and C (the only read for allele T being disregarded). For VarScan vcf, markers with more than one alternative allele are discarded because the VarScan AD field only contains one alternate read count. |
max.cov.per.pool |
Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded |
min.maf |
Minimal allowed Minor Allele Frequency (computed from the ratio overall read counts for the reference allele over the read coverage) |
remove.indels |
Remove indels identified using the number of characters of the alleles in the REF or ALT fields (i.e., if at least one allele is more than 1 character, the position is discarded) |
min.dist.from.indels |
Remove SNPs within min.dist.from.indels from an indel i.e. SNP with position p verifying (indel.pos-min.dist)<=p<=(indel.pos+min.dist+l.indels-1) where l.indel=length of the ref. indel allele. If min.dist.from.indels>0, INDELS are also removed (i.e., remove.indels is set to TRUE). |
nlines.per.readblock |
Number of Lines read simultaneously. Should be adapted to the available RAM. |
verbose |
If TRUE extra information is printed on the terminal |
Genotype format in the vcf file for each pool is assumed to contain either i) an AD field containing allele counts separated by a comma (as produced by popular software such as GATK or samtools/bcftools) or ii) both a RD (reference allele count) and a AD (alternate allele count) as obtained with the VarScan mpileup2snp program (when run with the –output-vcf option). The underlying format is automatically detected by the function. For VarScan generated vcf, it should be noticed that SNPs with more than one alternate allele are discarded (because only a single count is then reported in the AD fields) making the min.rc unavailable. The VarScan –min-reads2 option might replace to some extent this functionality although SNP where the two major alleles in the Pool-Seq data are different from the reference allele (e.g., expected to be more frequent when using a distantly related reference genome for mapping) will be disregarded.
A pooldata object containing 7 elements:
"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.readcount matrix (3rd column); and the alternative allele (4th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools
make.example.files(writing.dir=tempdir()) pooldata=vcf2pooldata(vcf.file=paste0(tempdir(),"/ex.vcf.gz"),poolsizes=rep(50,15))
make.example.files(writing.dir=tempdir()) pooldata=vcf2pooldata(vcf.file=paste0(tempdir(),"/ex.vcf.gz"),poolsizes=rep(50,15))