Title: | Estimation in Dirichlet-Multinomial Distribution |
---|---|
Description: | Estimate parameters in Dirichlet-Multinomial and compute log-likelihoods. |
Authors: | Torben Tvedebrink <[email protected]> |
Maintainer: | Torben Tvedebrink <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.3-5 |
Built: | 2024-12-18 06:53:24 UTC |
Source: | CRAN |
Computes the profile log-likelihood of
for an interval determined by a given difference in log-likelihood
value from the maximum log-likelihood value.
adapGridProf(data, delta, stepsize=50)
adapGridProf(data, delta, stepsize=50)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
delta |
The difference between max of log-likelihood and the profile log-likelihood. May be used to construct approximate confidence intervals, e.g. with delta = qchisq(0.95,df=1)*2. |
stepsize |
The stepsize used when stepping left/right of the
MLE. The stepsize used by the algorithm is given by the MLE of theta
divided by |
Gives a data frame with theta values and associated profile log-likelihood values.
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) adapGridProf(us[[1]],delta=0.5) ## Not run: adapGridProf(us[[1]],delta=qchisq(0.95,df=1)*2)
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) adapGridProf(us[[1]],delta=0.5) ## Not run: adapGridProf(us[[1]],delta=qchisq(0.95,df=1)*2)
Consider allele frequencies from different
subpopulations. The allele counts, , (or equivalently
allele frequencies) are expected to vary between subpopulation. This
variability are sometimes refered to as identity-by-decent, but may be
modelled as overdispersion due to intra-class correlation
. The allele counts within each subpopulation is
assumed to follow a multinomial distribution conditioned on the allele
probabilities,
. When
follows a Dirichlet distribution the marginal
distribution of
is Dirichlet-multinomial with parameters
and
with density:
Using an alternative parameterization the density may be written as:
where and
.
This formulation second parameterization is used in the iterations
since it converges much faster than the original parameterization.
The function dirmult
estimates the parameters
in the Dirichlet-multinomial distribution and
transform these into
and
.
dirmult(data, init, initscalar, epsilon=10^(-4), trace=TRUE, mode)
dirmult(data, init, initscalar, epsilon=10^(-4), trace=TRUE, mode)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
init |
Initial values for the |
initscalar |
Initial value for
|
epsilon |
Convergence tolerance. On termination the difference between to succeeding log-likelihoods must be smaller than epsilon. |
trace |
Logical. If TRUE the parameter estimates and log-likelihood value is printed to the screen after each iteration, otherwise no out-put is produces while iterating. |
mode |
Takes values "obs" (default) or "exp" determining whether the observed or expected FIM should be used in the Fisher Scoring. All other arguments produces an error message, but the observed FIM is used in the iterations. |
Returns a list containing:
loglik |
The final log-likelihood value. |
ite |
Number of iterations used. |
gamma |
A vector of |
pi |
A vector of |
theta |
Estimated |
data(us) fit <- dirmult(us[[1]],epsilon=10^(-4),trace=FALSE) dirmult.summary(us[[1]],fit)
data(us) fit <- dirmult(us[[1]],epsilon=10^(-4),trace=FALSE) dirmult.summary(us[[1]],fit)
Produces a summary table based on the estimated parameters from
dirmult
. The table contains MLE estimates and standard
errors together with method of moment (MoM) estimates and standard
errors based on MoM estimates from 'Weir and Hill (2002)'.
dirmult.summary(data, fit, expectedFIM=FALSE)
dirmult.summary(data, fit, expectedFIM=FALSE)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
fit |
Output from |
expectedFIM |
Logical. Determines whether the observed or expected Fisher Information Matrix should be used. For speed use observed (i.e. FALSE) - for accuracy (and theoretical support) use expected (i.e. TRUE). |
Summary table with estimates and standard errors for
and
.
data(us) fit <- dirmult(us[[1]],epsilon=10^(-4),trace=FALSE) dirmult.summary(us[[1]],fit)
data(us) fit <- dirmult(us[[1]],epsilon=10^(-4),trace=FALSE) dirmult.summary(us[[1]],fit)
Estimates parameters for each table under the
contraint that
is equal for all tables.
equalTheta(data, theta, epsilon=10^(-4), trace=TRUE, initPi, maxit=1000)
equalTheta(data, theta, epsilon=10^(-4), trace=TRUE, initPi, maxit=1000)
data |
A list of matrix or table with counts. Rows in the tables represent subpopulations and columns the different categories of the data. Zero columns are automaticly removed. |
theta |
Initial value of the commen theta paramter. |
epsilon |
Tolerance of the convergence, see |
trace |
Logical. TRUE: print estimates while iterating. |
initPi |
Initial values for each pi vector (one of each table). |
maxit |
Maximum number of iterations. |
Returns a list similar to the output of dirmult
.
## Not run: data(us) fit <- lapply(us[1:2],dirmult,epsilon=10^(-12),trace=FALSE) thetas <- unlist(lapply(fit,function(x) x$theta)) logliks <- unlist(lapply(fit,function(x) x$loglik)) fit1 <- equalTheta(us[c(1:2)],theta=mean(thetas),epsilon=10^(-12)) lr <- -2*(fit1$loglik-sum(logliks)) 1-pchisq(lr,df=1) fit1$theta[[1]] ## End(Not run)
## Not run: data(us) fit <- lapply(us[1:2],dirmult,epsilon=10^(-12),trace=FALSE) thetas <- unlist(lapply(fit,function(x) x$theta)) logliks <- unlist(lapply(fit,function(x) x$loglik)) fit1 <- equalTheta(us[c(1:2)],theta=mean(thetas),epsilon=10^(-12)) lr <- -2*(fit1$loglik-sum(logliks)) 1-pchisq(lr,df=1) fit1$theta[[1]] ## End(Not run)
Computes the profile log-likelihood of
for a given value of
,
i.e.
.
estProfLogLik(data, theta, epsilon=10^(-4), trace=TRUE, initPi, maxit=1000)
estProfLogLik(data, theta, epsilon=10^(-4), trace=TRUE, initPi, maxit=1000)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
theta |
The theta-value of which the profile log-likelihood is to be computed. |
epsilon |
Tolerance used in the iterations. Succeeding log-likelihood values need to be within epsilon for convergence. |
trace |
Logical. Whether parameter estimates and log-likelihood values should be printed to the screen while iterating. |
initPi |
Initial pi vector. |
maxit |
Maximum number of iterations. Default is 1000 and will often not be envoked, but if theta is to extreme compared to the MLE of theta the log-likelihood may misbehave near theta. |
Gives a list of components (similar to output from
dirmult
where loglik
and lambda
(the
Lagrange multiplier) are the most interesting.
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) estProfLogLik(us[[1]],fit$theta*1.2,epsilon=10^(-12),trace=FALSE)
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) estProfLogLik(us[[1]],fit$theta*1.2,epsilon=10^(-12),trace=FALSE)
Computes the profile log-likelihood of
for a given sequence of
by calling
estProfLogLik
.
gridProf(data, theta, from, to, len)
gridProf(data, theta, from, to, len)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
theta |
A theta-value used as offset for the interval: [theta+from; theta+to]. |
from |
Left endpoint in the interval: [theta+from; theta+to]. |
to |
Right endpoint in the interval: [theta+from; theta+to]. |
len |
Number of points in the [from; to] interval. Similar to the
|
Gives a data frame with theta values and associated profile log-likelihood values.
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) ## Not run: grid <- gridProf(us[[1]],fit$theta,from=-0.001,to=0.001,len=10) plot(loglik ~ theta, data=grid, type="l") ## End(Not run)
data(us) fit <- dirmult(us[[1]],epsilon=10^(-12),trace=FALSE) ## Not run: grid <- gridProf(us[[1]],fit$theta,from=-0.001,to=0.001,len=10) plot(loglik ~ theta, data=grid, type="l") ## End(Not run)
Simulates data sets under the null-hypothesis,
. This corresponds to an ordinary multinomial
model without any overdispersion. Based on the returned data frame
simulated
-values may be computed.
nullTest(data, m=1000, prec=6)
nullTest(data, m=1000, prec=6)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
m |
Number of simulated data tables. |
prec |
The tolerance of the iterations. Corresponds to
epsilon=1e-prec in |
Returns a data frame with theta estimates and log-likelihood values.
data(us) ## Not run: nullTest(us[[1]],m=50)
data(us) ## Not run: nullTest(us[[1]],m=50)
Simulates from a Dirichlet distribution
rdirichlet(n=1, alpha)
rdirichlet(n=1, alpha)
n |
The number of samples. |
alpha |
The shape parameters, need to be positive. |
Return an n x length(alpha) matrix where each row is drawn from a Dirichlet.
rdirichlet(n=100, alpha=rep(1,10))
rdirichlet(n=100, alpha=rep(1,10))
Simulates data using user defined value and allele
probabilities in the reference population,
.
simPop(J=10, K=20, n, pi, theta)
simPop(J=10, K=20, n, pi, theta)
J |
The number of subpopulations sampled. |
K |
Number of different alleles. If argument |
n |
The number of alleles sampled in each subpopulation. If
scalar repeated for all subpopulations, otherwise a vector of length
|
pi |
Vector of allele probabilities. If missing a random vector
of length |
theta |
The theta-value used for simulations. |
Return an J x K matrix with allelic counts.
simPop(n=100, theta=0.03)
simPop(n=100, theta=0.03)
9 STR loci were typed in sample populations of African Americans, U.S. Caucasians, Hispanics, Bahamians, Jamaicans, and Trinidadians.
A list of tables with allele counts.
http://www.fbi.gov/hq/lab/fsc/backissu/july1999/budowle.htm
Budowle, B., Moretti, T. R., Baumstark, A. L., Defenbaugh, D. A., and Keys, K. M. Population data on the thirteen CODIS core short tandem repeat loci in African Americans, U.S. Caucasians, Hispanics, Bahamians, Jamaicans, and Trinidadians, Journal of Forensic Sciences. 1999.
Estimates using a method of moment (MoM) estimate
by 'Weir and Hill (2002).'
weirMoM(data, se=FALSE)
weirMoM(data, se=FALSE)
data |
A matrix or table with counts. Rows represent subpopulations and columns the different categories of the data. Zero rows or columns are automaticly removed. |
se |
Logical. Determines if a standard error of theta sould be computed or not. The variance is based on an expression by Li cited in 'Weir and Hill (2002)'. |
MoM-estimate (and standard error) of theta.
Weir, B. S. and W. G. Hill (2002). 'Esimating F-statistics'. Annu Rev Genet 36: 721-750
data(us) weirMoM(us[[1]],se=TRUE)
data(us) weirMoM(us[[1]],se=TRUE)