Title: | Estimating Abundance of Clones from DNA Fragmentation Data |
---|---|
Description: | Estimate the abundance of cell clones from the distribution of lengths of DNA fragments (as created by sonication, whence `sonicLength'). The algorithm in "Estimating abundances of retroviral insertion sites from DNA fragment length data" by Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CR, Bushman FD. Bioinformatics; 2012 Mar 15;28(6):755-62 is implemented. The experimental setting and estimation details are described in detail there. Briefly, integration of new DNA in a host genome (due to retroviral infection or gene therapy) can be tracked using DNA sequencing, potentially allowing characterization of the abundance of individual cell clones bearing distinct integration sites. The locations of integration sites can be determined by fragmenting the host DNA (via sonication or fragmentase), breaking the newly integrated DNA at a known sequence, amplifying the fragments containing both host and integrated DNA, sequencing those amplicons, then mapping the host sequences to positions on the reference genome. The relative number of fragments containing a given position in the host genome estimates the relative abundance of cells hosting the corresponding integration site, but that number is not available and the count of amplicons per fragment varies widely. However, the expected number of distinct fragment lengths is a function of the abundance of cells hosting an integration site at a given position and a certain nuisance parameter. The algorithm implicitly estimates that function to estimate the relative abundance. |
Authors: | Charles Berry <[email protected]> |
Maintainer: | Charles Berry <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4.7 |
Built: | 2025-01-03 07:16:40 UTC |
Source: | CRAN |
Estimate the number of fragments produced by sonication of integration sites in DNA (as arise in gene therapy or in retroviral infections) from tallies of the number of distinct lengths. See the Berry et al. reference where the experimental setting and algorithm are spelled out in detail.
Package: | sonicLength |
Type: | Package |
License: | see file LICENSE distributed with this software |
LazyLoad: | yes |
This package has the necessary utilities to provide a function that will do calibration of the number of distinct lengths (for an integration site recovered via sonication) to estimate the number of distinct sonication fragments that underlie the observed number of lengths.
Charles C. Berry
Maintainer: <[email protected]>
The reference to the algorithm is:
Estimating abundances of retroviral insertion sites from DNA fragment length data. Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CR, Bushman FD. Bioinformatics; 2012 Mar 15;28(6):755-62.
Some applications that illustrate use of this method are found here:
Gillet, Nicolas A., et al. "The host genomic environment of the provirus determines the abundance of HTLV-1 infected T-cell clones." Blood 117.11 (2011): 3113-3122.
Maldarelli, F., et al. "Specific HIV integration sites are linked to clonal expansion and persistence of infected cells." Science 345.6193 (2014): 179-183.
A sample from a patient with HTLV-1 was analyzed as described in Gillet et al (2001).
data(A1)
data(A1)
A data frame with 6176 observations on the following 3 variables.
locations
a factor with levels 01 F 101480450
... X R 99972891
indicating the genomic sites at which
insertions were detected
lengths
a numeric vector of fragment lengths
replicates
a numeric vector giving replicate IDs
For each replicate, the unique combinations of the genome location at which an insertion site was detected and the length(s) of the fragment(s) sequenced were noted. There is one such row in he data.frame for each such combination.
See Gillet et al 2011 for more details.
Gillet, N.A., Malani, N., Melamed, A., Gormley, N., Carter, R., Bentley, D., Berry, C., Bushman, F.D., Taylor, G.P., and Bangham, C.R. (2011). The host genomic environment of the provirus determines the abundance of HTLV-1-infected T-cell clones. Blood 117, 3113-3122.
data(A1) str(A1)
data(A1) str(A1)
When one or more replicate samples are sonicated and the lengths for each integration site are recorded, it is of interest to estimate actual number of sonicants allowing for coincidentally equal lengths. A likelihood approach is used here.
estAbund(locations, lengths, replicates = NULL, jackknife = F, kmax=0, min.length = 20, ...)
estAbund(locations, lengths, replicates = NULL, jackknife = F, kmax=0, min.length = 20, ...)
locations |
a vector of IDs for distinct locations |
lengths |
a vector of the corresponding lengths |
replicates |
an optional vector of the replicate sample ID |
jackknife |
Whether to do leave-one-out jackknife samples over the replicates |
kmax |
highest count to bother with when computing mass function
for counts. All higher values are
globbed together in the result. If |
min.length |
Least length expected for a fragment. Shorter lengths will not be permitted |
... |
Other arguments that may be passed along to functions that do the real work. |
This is a wrapper function for maxEM
so study that
function to see what is going on.
a list with components
theta |
estimated abundances |
phi |
estimated bin probabilities |
var.theta |
variances of |
iter |
number of iterations till convergence was achieved |
call |
the function call |
lframe |
the |
obs |
the observed counts of distinct lengths |
jackknife |
a list of calls to |
data |
a |
Charles C Berry [email protected]
Estimating abundances of retroviral insertion sites from DNA fragment length data. Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CR, Bushman FD. Bioinformatics; 2012 Mar 15;28(6):755-62.
maxEM
phi.update.lframe
The package vignette:Estimating Abundances with sonicLength
Take the E-step of the EM algorithm for estimating the number of sonicants in a sample
Ey.given.x(x, theta, phi) pr.y.given.x(x, theta, phi, kmax=20 )
Ey.given.x(x, theta, phi) pr.y.given.x(x, theta, phi, kmax=20 )
x |
a zero-one indicator matrix whose rows correspond to unique lengths with rownames indicating those lengths |
theta |
a vector of the abundance estimates. |
phi |
a vector of the probabilities of sonicant
lengths. |
kmax |
highest count to bother with (all higher values are globbed together in the result) |
Supposing Poisson sampling of sonicants,
Ey.given.x(x,theta,phi)[i,j]
gives the expected value of the
number of sonicants given that x[i,j]
distinct sonicant lengths
were observed. pr.y.given.x(x,theta,phi.kmax)[k,j]
gives the
probability of k
sonicants (censored at kmax+1
) of
insertion site j
Ey.given.x()
is a double matrix of the same dimension as
x
. pr.y.given.x(...,kmax)
is a double matrix of
dimension c( kmax+1, ncol(x) )
Charles C. Berry [email protected]
mat <- diag(10) mat[ ,10 ] <- 1.0 phi1 <- prop.table( rep(1,10)) theta1 <- 1:10 Ey.given.x( mat, theta1, phi1 ) pr.mat <- pr.y.given.x( mat, theta1, phi1 ) ## Estimate Seen plus Unseen sites popsize.chao <- function(tab) sum(tab)+tab[1]*(tab[1]-1)/(2*(tab[2]+1)) ## evaluate at expected counts popsize.chao( rowSums(pr.mat) ) ## average randomly sampled counts cnt.samp <- function(x) sample( seq_along(x) , 1 ,pr=x ) mean(replicate(100,popsize.chao( table(apply(pr.mat,2, cnt.samp) ))))
mat <- diag(10) mat[ ,10 ] <- 1.0 phi1 <- prop.table( rep(1,10)) theta1 <- 1:10 Ey.given.x( mat, theta1, phi1 ) pr.mat <- pr.y.given.x( mat, theta1, phi1 ) ## Estimate Seen plus Unseen sites popsize.chao <- function(tab) sum(tab)+tab[1]*(tab[1]-1)/(2*(tab[2]+1)) ## evaluate at expected counts popsize.chao( rowSums(pr.mat) ) ## average randomly sampled counts cnt.samp <- function(x) sample( seq_along(x) , 1 ,pr=x ) mean(replicate(100,popsize.chao( table(apply(pr.mat,2, cnt.samp) ))))
Density, distribution function, quantile function and random generation for a simple parametric distribution for fragments lengths (given by conditioning a geometric distribution on whether the length can be recovered).
dfrag( x, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate) ) pfrag( q, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate), lower.tail=TRUE ) qfrag( p, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate), lower.tail=TRUE ) rfrag( n, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate) )
dfrag( x, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate) ) pfrag( q, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate), lower.tail=TRUE ) qfrag( p, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate), lower.tail=TRUE ) rfrag( n, loc=45, lscale=2.5, rate=0.02, maxx=qgeom(1-1e-7,rate) )
x , q
|
vector of quantile (of lengths). |
n |
number of lengths to sample. |
p |
vector of probabilities |
loc |
vector of locations of logistic for prob of recovery. |
lscale |
vector of scales of logistic for prob of recovery. |
rate |
probability for geometric probability for fragment lengths. |
maxx |
integer; largest value of x to bother with. |
lower.tail |
logical; if TRUE (default), probabilities are
|
The mass function is given by
plogis(x,loc,lscale)*dgeom(x,rate) / denom
, where denom
scales the result to sum to 1.0 in the range 0:maxx
. The other
functions all depend on this in the obvious manner. If maxx
is
not large enough a warning may be issued, but even without this warning
the results may be slightly innaccurate if
pgeom(maxx,rate,lower.tail=FALSE)
is non-negligible.
dfrag
gives the mass function,
pfrag
gives the distribution function,
qfrag
gives the quantile function, and
rfrag
generates random deviates.
Charles C. Berry [email protected]
plot( 0:300, table(factor(rfrag(2000),0:300)) ) lines( 0:300, 2000*dfrag(0:300) )
plot( 0:300, table(factor(rfrag(2000),0:300)) ) lines( 0:300, 2000*dfrag(0:300) )
From information about the lengths of sonicants of integration sites, infer the relative abundances of different clones and the distributon of sonicant lengths
maxEM(slmat, theta.var=FALSE, phi.update=NULL, phi.deriv=NULL, lframe = NULL, glm.frm = NULL, iter.control=NULL, ... )
maxEM(slmat, theta.var=FALSE, phi.update=NULL, phi.deriv=NULL, lframe = NULL, glm.frm = NULL, iter.control=NULL, ... )
slmat |
a matrix whose rows correspond to unique lengths with rownames indicating those lengths |
theta.var |
logical, return variance of theta estimates? |
phi.update |
a function of an object like |
phi.deriv |
function of theta and phi that returns derivatives of phi wrt beta (its parameters) |
lframe |
a |
glm.frm |
a formula like |
iter.control |
a list of default values for iteration control - see |
... |
possibly other args to pass to |
The EM method is used to infer the relative abundances of different sites.
theta |
a vector of the abundance estimates |
phi |
a vector of the probabilities of sonicant lengths |
call |
the call used |
Charles C. Berry [email protected]
mat0 <- matrix(0,nr=48,nc=140) vals <- c(rep(1,100),2:40,100) mat1 <- sapply( vals, function(x) as.numeric(is.element(1:200 ,rgeom(x,.02)))) mat <- rbind(mat0,mat1) posVals <- colSums(mat) > 0 vals <- vals[ posVals ] mat <- mat[, posVals ] rownames(mat) <- 1:nrow(mat) res <- maxEM(mat) matplot( vals, cbind(res$theta, colSums(mat)), pch=1:2, xlab='original values', ylab='estimated values', main='Simulated Sonicants and Estimates') legend( "bottomright", pch=1:2, col=1:2, legend=c(expression(hat(theta)[j]),expression(sum(y[ij],i)))) abline(a=0,b=1,col='gray')
mat0 <- matrix(0,nr=48,nc=140) vals <- c(rep(1,100),2:40,100) mat1 <- sapply( vals, function(x) as.numeric(is.element(1:200 ,rgeom(x,.02)))) mat <- rbind(mat0,mat1) posVals <- colSums(mat) > 0 vals <- vals[ posVals ] mat <- mat[, posVals ] rownames(mat) <- 1:nrow(mat) res <- maxEM(mat) matplot( vals, cbind(res$theta, colSums(mat)), pch=1:2, xlab='original values', ylab='estimated values', main='Simulated Sonicants and Estimates') legend( "bottomright", pch=1:2, col=1:2, legend=c(expression(hat(theta)[j]),expression(sum(y[ij],i)))) abline(a=0,b=1,col='gray')
Set parameters for control of iteration in maxEM
maxEM.iter.control(min.reps = 3, max.reps = 2000, max.abs.le = 0.1, max.rel.le = 1e-05, phi.min=.Machine$double.eps)
maxEM.iter.control(min.reps = 3, max.reps = 2000, max.abs.le = 0.1, max.rel.le = 1e-05, phi.min=.Machine$double.eps)
min.reps |
a positive integer |
max.reps |
a positive integer |
max.abs.le |
a smallish double |
max.rel.le |
a tiny double |
phi.min |
a smallish double |
Most users should not need to tinker with these settings
a list such as formals(maxEM.iter.control)
Charles C. Berry [email protected]
This is a utility function for maxEM
mstep(x, theta, phi)
mstep(x, theta, phi)
x |
a zero-one indicator matrix whose rows correspond to unique lengths with rownames indicating those lengths |
theta |
a vector of the abundance estimates. |
phi |
a vector of the probabilities of sonicant lengths. |
The M-step for theta is computed. Probably, there is no need to use this function directly, but just in case it is here.
a vector like the input theta
Charles C. Berry
mat <- diag(10) mat[ ,10 ] <- 1.0 phi1 <- prop.table( rep(1,10)) theta1 <- 1:10 sonicLength:::mstep( mat, theta1,phi1)
mat <- diag(10) mat[ ,10 ] <- 1.0 phi1 <- prop.table( rep(1,10)) theta1 <- 1:10 sonicLength:::mstep( mat, theta1,phi1)
Estimate phi in a flexible manner as guided by lframe
- usually
involving multiple strata, and allowing for formulas other than the
default choice
phi.update.lframe(obj, return.fit = FALSE, lframe, glm.frm = NULL, ...)
phi.update.lframe(obj, return.fit = FALSE, lframe, glm.frm = NULL, ...)
obj |
a matrix whose rowSums are used to estimate phi |
return.fit |
logical - the result of the fit should be returned instead of phi |
lframe |
A |
glm.frm |
a formula. If omitted |
... |
curently not used |
fitting phi - the probabilities that a sonicant lands in a particular
bin - is crucial to estimating theta, then number of sonicants when
more than one lans in a bin. bins may be defined by sonicant lengths
when there is just one sample. When there are multiple samples, then
one may wish to estimate phi separately for each
one. lframe$strata
indicates separate sample and the default
glm.frm
will fit each one separately.
a vector whose sum is 1.0 with one element for each row of obj
Charles C. Berry [email protected]
Sonication and Processing of DNA containing retroviral sequences yields genomic locations of retroviral insertion sites and a set of associated fragment lengths. The simulation depends on the expected number of fragments from each site and the distribution of lengths.
simSonic( theta, phi )
simSonic( theta, phi )
theta |
a vector of expected number of fragments from each site. If the vector has unqiue names, these will be used to label the locations |
phi |
a vector of probabilities (or a list containing a vector
named |
This object can provide the arguments used by estAbund
a data.frame
with columns locations
,
lengths
, and replicates
. See estAbund
for
more details.
Charles C. Berry [email protected]
theta <- seq(0.5,20.5,by=0.5) phi <- prop.table(1:10) names(phi) <- paste( 1 , 51:60 ) res <- simSonic( theta, phi ) head(res) tail(res) summary(res)
theta <- seq(0.5,20.5,by=0.5) phi <- prop.table(1:10) names(phi) <- paste( 1 , 51:60 ) res <- simSonic( theta, phi ) head(res) tail(res) summary(res)