Package: preseqR 4.0.0

Chao Deng

preseqR: Predicting Species Accumulation Curves

Originally as an R version of Preseq <doi:10.1038/nmeth.2375>, the package has extended its functionality to predict the r-species accumulation curve (r-SAC), which is the number of species represented at least r times as a function of the sampling effort. When r = 1, the curve is known as the species accumulation curve, or the library complexity curve in high-throughput genomic sequencing. The package includes both parametric and nonparametric methods, as described by Deng C, et al. (2018) <arxiv:1607.02804v3>.

Authors:Chao Deng, Timothy Daley and Andrew D. Smith

preseqR_4.0.0.tar.gz
preseqR_4.0.0.tar.gz(r-4.5-noble)preseqR_4.0.0.tar.gz(r-4.4-noble)
preseqR_4.0.0.tgz(r-4.4-emscripten)preseqR_4.0.0.tgz(r-4.3-emscripten)
preseqR.pdf |preseqR.html
preseqR/json (API)

# Install 'preseqR' in R:
install.packages('preseqR', repos = 'https://cloud.r-project.org')
Datasets:

On CRAN:

Conda:r-preseqr-4.0.0(2025-03-25)

This package does not link to any Github/Gitlab/R-forge repository. No issue tracker or development information is available.

3.26 score 1 stars 3 packages 600 downloads 4 mentions 20 exports 1 dependencies

Last updated 7 years agofrom:311a1e5602. Checks:3 OK. Indexed: yes.

TargetResultLatest binary
Doc / VignettesOKMar 23 2025
R-4.5-linuxOKMar 23 2025
R-4.4-linuxOKMar 23 2025

Exports:bbc.rSACcs.rSACds.rSACds.rSAC.bootstrapfisher.alphafisher.rSACkmer.frac.curvekmer.frac.curve.bootstrappreseqR.interpolate.rSACpreseqR.nonreplace.samplingpreseqR.optimal.sequencingpreseqR.rSACpreseqR.rSAC.bootstrappreseqR.rSAC.sequencing.rmduppreseqR.sample.covpreseqR.sample.cov.bootstrappreseqR.simu.histpreseqR.ztnb.emztnb.rSACztp.rSAC

Dependencies:polynom

Citation

To cite preseqR in publications use:

Deng C, Daley T and Smith AD (2015). Applications of species accumulation curves in large-scale biological data analysis. Quantitative Biology, 3(3), 135-144. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885658.

Deng C, Daley T, Calabrese P, Ren J and Smith AD (2018). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint. URL https://arxiv.org/abs/1607.02804v3.

Corresponding BibTeX entries:

  @Article{,
    title = {Applications of species accumulation curves in large-scale
      biological data analysis},
    author = {Chao Deng and Timothy Daley and Andrew D. Smith},
    journal = {Quantitative Biology},
    year = {2015},
    volume = {3},
    number = {3},
    pages = {135--144},
    url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885658},
  }
  @Article{,
    title = {Estimating the number of species to attain sufficient
      representation in a random sample},
    author = {Chao Deng and Timothy Daley and Peter Calabrese and Jie
      Ren and Andrew D. Smith},
    journal = {arXiv},
    year = {2018},
    url = {https://arxiv.org/abs/1607.02804v3},
  }

Readme and manuals

UPDATES TO VERSION 4.0.0

  1. Improve the user interface for core functions
  2. Add functions to optimize the depth of single-cell whole-genome sequencing experiments and whole-exome sequencing experiments
  3. Add functions to predict the sample coverage, which is the probability of sampling an observed species from a population
  4. Add functions to predict the fraction of k-mers represented at least r times in a sequencing experiment

UPDATES TO VERSION 3.1.2

  1. Fix a bug for removing defects

UPDATES TO VERSION 3.1.1

  1. Substitute embedded c++ code with R code
  2. Remove the dependencies on the software preseq

UPDATES TO VERSION 3.0.1

  1. Fix a bug in Chao's estimator
  2. Fix issues for a Solaris C++ compiler.

UPDATES TO VERSION 3.0.0

  1. We have changed the return types of many functions in the package. These functions no longer generate estimated accumulative curves. Instead, they return function types, which are estimators for the number of species represented by at least r indivdiduals in a random sample.

  2. We added several estimators for predicting the number of species represented by at least r individuals in a random sample

UPDATES TO VERSION 2.1.1

We have changed the interfaces for most of our exported functions. We add new estimators for the number of species represented by at least r individuals in a random sample.

preseqR

Code in this repository aims to expand the functionality of Preseq available in the R statistical computing enviroment. There are five ways this is supposed to work:

  1. The basic functionality of the preseq program, initially focusing only on library complexity, is available. These functions contain the string "rfa" as part of their names.
  2. The mathematical routines for doing rational function approximation via continued fractions is implemented as a wrapper for our existing functionality in C++.
  3. Fitting a zero-truncated negative binomial distribution to the sample is available. These functions include the string "ztnb" as part of the names.
  4. The simulation module is used to generate samples based on mixture of Poisson.
  5. Extra functions are provided to estimate the number of species represented at least r times in a random sample.

See https://cran.r-project.org/package=preseqR for details.

INSTALLATION

  1. We recommand everyone to install the package preseqR from CRAN. It can be easily done by opening an R shell and typing:

    >install.packages("preseqR")

  2. The following instructions are for installing the package from the source. Assume the source code of preseqR has been pulled from the git repo and it is under the current directory. Open an R shell and type:

    >install.packages("polynom")

    >install.packages("preseqR", repos=NULL, type="source")

    Note that the package polynom is required by preseqR.

Help Manual

Help pageTopics
Predicting r-species accumulation curvespreseqR-package
BBC estimatorbbc.rSAC
CS estimatorcs.rSAC
Dickens' vocabularyDickens
RFA estimatords.rSAC
RFA estimator with bootstrapds.rSAC.bootstrap
Parameter alpha in the logseries estimatorfisher.alpha
Logseries estimatorfisher.rSAC
Fisher's butterfly dataFisherButterfly
Fraction of k-mers observed at least r timeskmer.frac.curve
Fraction of k-mers observed at least r times with bootstrapkmer.frac.curve.bootstrap
InterpolationpreseqR.interpolate.rSAC
SamplingpreseqR.nonreplace.sampling
Optimal amount of sequencing for scWGSpreseqR.optimal.sequencing
Best practice for r-SAC - a fast versionpreseqR.rSAC
Best practice for r-SACpreseqR.rSAC.bootstrap
Predicting r-SAC in WES/WGSpreseqR.rSAC.sequencing.rmdup
Predicting generalized sample coveragepreseqR.sample.cov
Predicting generalized sample coverage with bootstrappreseqR.sample.cov.bootstrap
SimulationpreseqR.simu.hist
Fitting a zero-truncated negative binomial distributionpreseqR.ztnb.em
Shakespeare's word type frequenciesShakespeare
k-mer counts of a metagenomic dataSRR061157_k31
Coverage histogram of a WES dataSRR1301329_1M_base
Read counts of a WES dataSRR1301329_1M_read
Coverage histogram of a WES dataSRR1301329_base
Read counts of a WES dataSRR1301329_read
Coverage histogram of a scWGS dataSRR611492
Coverage histogram of a scWGS dataSRR611492_5M
Social networkTwitter
Fisher's butterfly dataWillButterfly
ZTNB estimatorztnb.rSAC
ZTP estimatorztp.rSAC