Title: | Tools for ABC Analyses |
---|---|
Description: | Tools for approximate Bayesian computation including summary statistic selection and assessing coverage. See Nunes and Prangle (2015) <doi:10.32614/RJ-2015-030> and Rodrigues, Prangle and Sisson (2018) <doi:10.1016/j.csda.2018.04.004>. |
Authors: | Matt Nunes [aut, cre], Dennis Prangle [aut], Guilherme Rodrigues [ctb] |
Maintainer: | Matt Nunes <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.7 |
Built: | 2024-12-12 07:06:18 UTC |
Source: | CRAN |
Tools for approximate Bayesian computation including summary statistic selection and assessing coverage.
Matt Nunes and Dennis Prangle
Maintainer: Matthew Nunes <[email protected]>
This package is described in:
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning approximate Bayesian computation analyses. The R Journal 7, Issue 2, 189–205.
For details on methods for summary statistics selection, see:
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. B 74, Part 3, 1–28.
Joyce, P. and P. Marjoram (2008) Approximately sufficient statistics and Bayesian computation. Stat. Appl. Gen. Mol. Biol. 7 Article 26.
Nunes, M. A. and Balding, D. J. (2010) On Optimal Selection of Summary Statistics for Approximate Bayesian Computation. Stat. Appl. Gen. Mol. Biol. 9, Iss. 1, Art. 34.
Details of ABC coverage methods appear in:
Prangle D., Blum M. G. B., Popovic G., Sisson S. A. (2014) Diagnostic tools of approximate Bayesian computation using the coverage property. Australian and New Zealand Journal of Statistics 56, Issue 4, 309–329.
This function uses approximate sufficiency to assess subsets of summary statistics for ABC inference.
AS.select(obs, param, sumstats, obspar=NULL, abcmethod=abc, grid=10, inturn=TRUE, limit=ncol(sumstats), allow.none=FALSE, do.err=FALSE, final.dens=FALSE, errfn=rsse, trace=TRUE, ...)
AS.select(obs, param, sumstats, obspar=NULL, abcmethod=abc, grid=10, inturn=TRUE, limit=ncol(sumstats), allow.none=FALSE, do.err=FALSE, final.dens=FALSE, errfn=rsse, trace=TRUE, ...)
obs |
(matrix of) observed summary statistics. |
param |
matrix of simulated model parameter values. |
sumstats |
matrix of simulated summary statistics. |
obspar |
optional observed parameters (for use to assess simulation performance). |
abcmethod |
a function to perform ABC inference, e.g. the |
grid |
the number of bins into which to divide the posterior sample for the approximate sufficiency calculation. |
inturn |
a boolean value indicating whether "bad" statistics should be dropped and tested sequentially ( |
limit |
an optional integer indicating whether to limit summary selection to subsets of a maximum size. |
allow.none |
a boolean values indicating whether an empty subset of statistics is considered in the selection procedure. |
do.err |
a boolean value indicating whether the simulation error should be returned. Note: if |
final.dens |
a boolean value indicating whether the posterior sample should be returned. |
errfn |
an error function to assess ABC inference performance. |
trace |
whether to show progress messages. |
... |
any other optional arguments to the ABC inference procedure (e.g. arguments to the |
The summary selection procedure works by sequentially testing randomly chosen statistics for inclusion, using the ratio of ABC posterior samples to determine whether a statistic is added. Since adding a statistic may result in a suboptimal subset of summaries, the included statistics are then individually dropped and retested, to determine whether a smaller subset of statistics is equally / more informative than the accepted set of statistics.
A list with the following components:
best |
the final subset of included statistics. |
err |
simulation error (if |
post.sample |
an array of dimension |
The approximate sufficiency techniques described here are only suitable for single parameters only.
Matt Nunes
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Joyce, P. and P. Marjoram (2008) Approximately sufficient statistics and
Bayesian computation. Stat. Appl. Gen. Mol. Biol. 7
Article 26.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) # example of AS.select: ## Not run: tmp <-AS.select(obsstats, param, simstats, tol=.1, method="neuralnet", nument=5, allow.none=FALSE, inturn=TRUE) tmp$best ## End(Not run)
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) # example of AS.select: ## Not run: tmp <-AS.select(obsstats, param, simstats, tol=.1, method="neuralnet", nument=5, allow.none=FALSE, inturn=TRUE) tmp$best ## End(Not run)
The function tests to determine adding a (set of) statistics is informative in ABC inference.
AS.test(grid = 10, x1, x2, supp=NULL)
AS.test(grid = 10, x1, x2, supp=NULL)
grid |
the number of bins into which to divide the posterior sample for the approximate sufficiency calculation. |
x1 |
the posterior sample using the first set of summary statistics. |
x2 |
the posterior sample using the second (alternative) set of summary statistics. |
supp |
the "support" of the prior (e.g. uniform bounds). |
After dividing each posterior sample into a number of bins (specified by grid
), the function computes the ratio of the posterior densities. This is seen as a measure of information added (sufficiency) by using the alternative posterior sample instead of the first posterior sample. If the ratio exceeds a particular threshold (a number of standard deviations away from the expected counts in each bin), then the alternative set of summaries is seen as being more informative.
extreme |
a boolean value indicating whether the alternative posterior sample is more informative than the first (i.e. the extra summary statistics add information. |
Matt Nunes
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Joyce, P. and P. Marjoram (2008) Approximately sufficient statistics and
Bayesian computation. Stat. Appl. Gen. Mol. Biol. 7
Article 26.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
#create two fake posterior samples: x1<-runif(10000) x2<-rnorm(10000) AS.test(x1=x1,x2=x2,supp=range(x2))
#create two fake posterior samples: x1<-runif(10000) x2<-rnorm(10000) AS.test(x1=x1,x2=x2,supp=range(x2))
Data generated from a coalescent model for genetic variation.
data(coal)
data(coal)
Matrices of parameters (first 2 columns) and summary statistics (remaining columns) from a coalescent model.
"coal" contains 100,000 simulated datasets to use for an ABC analysis. "coalobs" contains 100 which can each be used as observed datasets. This allows methods to be tested on many different observations.
The parameters are the scaled mutation rate (theta) and scaled recombination rate (rho).
The summary statistics are the number of segregating sites (segsites), independent Uniform[0,25] random noise (unif), the pairwise mean number of nucleotidic differences (meandiff), the mean R^2 across pairs separated by < 10% of the simulated genomic regions (R2), the number of distinct haplotypes (nhap), the frequency of the most common haplotype (fhap), and the number of singleton haplotypes (shap).
To simulate a dataset 5,001 basepair DNA sequences for 50 individuals are generated from the coalescent model, with recombination, under the infinite-sites mutation model, using the software ms (Hudson 2002)
Data of this form has been analysed in several ABC papers. See the references for more details.
Blum, M. G. B., M. A. Nunes, D. Prangle and S. A. Sisson 'A comparative review of dimension reduction methods in approximate Bayesian computation' Statistical Applications in Genetics and Molecular Biology 13 (2014)
Hudson, R. R. 'Generating samples under a Wright-Fisher neutral model of genetic variation' Bioinformatics 18 (2002)
Joyce, P. and P. Marjoram 'Approximately sufficient statistics and Bayesian computation' Statistical Applications in Genetics and Molecular Biology 7 (2008)
Nunes, M. A. and D. J. Balding 'On optimal selection of summary statistics for approximate Bayesian computation' Statistical Applications in Genetics and Molecular Biology 9 (2010)
Nunes, M. A. and Prangle, D. abctools: an R package for tuning approximate Bayesian computation analyses. The R Journal 7 (2016)
This function creates a table of binary masks representing combinations of statistics.
combmat(n,limit = NULL)
combmat(n,limit = NULL)
n |
number of statistics |
limit |
an optional (integer) value indicating whether to limit the table to subsets up to a certain size. |
m |
The matrix of binary masks. |
Matt Nunes
# # Find all binary masks of a set of statistics {C1,C2,C3,C4}, # listing all singlets, pairs, triples and then the whole set: combmat(4,TRUE)
# # Find all binary masks of a set of statistics {C1,C2,C3,C4}, # listing all singlets, pairs, triples and then the whole set: combmat(4,TRUE)
These functions produce diagnostic statistics for an ABC analysis to judge when the tolerance level is small enough to produce roughly no approximation error. This is done by running analyses for many pseudo-observed test data sets and assessing whether the results satisfy the "coverage property" (roughly speaking: credible intervals have the claimed coverage levels).
cov.pi(param, sumstat, testsets, tol, eps, diagnostics = c(), multicore = FALSE, cores, method = "rejection", nacc.min=20, ...) cov.mc(index, sumstat, testsets, tol, eps, diagnostics = c(), multicore = FALSE, cores, method = "rejection", nacc.min=20, ...) covstats.pi(raw, diagnostics = c("KS", "CGR"), nacc.min = 20) covstats.mc(raw, index, diagnostics = c("freq", "loglik.binary", "loglik.multi"), nacc.min = 20)
cov.pi(param, sumstat, testsets, tol, eps, diagnostics = c(), multicore = FALSE, cores, method = "rejection", nacc.min=20, ...) cov.mc(index, sumstat, testsets, tol, eps, diagnostics = c(), multicore = FALSE, cores, method = "rejection", nacc.min=20, ...) covstats.pi(raw, diagnostics = c("KS", "CGR"), nacc.min = 20) covstats.mc(raw, index, diagnostics = c("freq", "loglik.binary", "loglik.multi"), nacc.min = 20)
param |
A data frame of parameter values. It must have the same number of
rows as |
index |
A vector of model indices. Any value which can be converted to
factor is ok (e.g. character or numeric entries). It must have the
same length as |
sumstat |
A data frame of summary statistic values whose the ith row has been
simulated using |
testsets |
A numeric vector giving the rows of |
tol |
A vector of proportions of ABC acceptances which will be investigated. |
eps |
A vector of ABC thresholds which will be investigated. These are
used when |
diagnostics |
A character vector containing diagnostics to be calculated.
Allowable values for parameter inference are "KL" (Kullback-Leibler
based test) or "CGR" (Cook, Gelman and Rubin test). Allowable values
for model choice are "freq" (a separate frequency-based test for
each model), "loglik.binary" (a separate |
multicore |
Whether to use the |
cores |
Number of cores to use when |
method |
Method used for ABC analysis. The default is "rejection". For
alternatives see |
nacc.min |
Minimum number of ABC acceptances required to compute diagnostics. See Values for details of how this is used. |
... |
Extra arguments to be supplied to the function performing abc
analysis i.e. |
raw |
Raw output component from |
These functions are intended to be applied as follows (i) random
models/parameters are generated, data sets simulated for each and
summary statistics calculated (ii) these are input to cov.pi
(parameter inference) or cov.mc
(model choice) which outputs raw
results and diagnostics (see below) (iii) the output can be passed to
covstats.pi
or covstats.mc
if further diagnostics are
required later (or to find diagnostics for a subset of the pseudo-observed
data).
The cov.pi
and cov.mc
functions operate by performing
many ABC analyses. The user specifies which datasets amongst those
simulated will used as pseudo-observed "test" data to be analysed.
The results of each analysis are compared to the known
model/parameters which produced the data to see whether they are
consistent in a particular sense (i.e. if the coverage property is
satisfied). Various diagnostics are provided to judge this easily, and
determine what happens as the ABC threshold is varied. Raw results
are also returned which can be investigated in more detail.
All ABC analyses use a rejection sampling algorithm implemented by the
abc
package. The user may specify regression
post-processing as part of this analysis.
Output of cov.pi
or cov.mc
is a list of two data frames,
raw
and diag
. The covstats.pi
and
covstats.mc
functions just return the latter data frame.
For parameter inference, raw
contains estimated cdfs (referred
to as p0 estimates in Prangle et al 2013) of the true parameter values
for each input configuration (i.e. for every tol/eps value at every
test dataset). diag
is a data frame of tol/eps value,
parameter name, diagnostic name and p-value. Here the p-value relates
to the test statistic used as a diagnostic. It is NA if any analyses
had fewer than nacc.min
acceptances (Diagnostics based on a
small number of acceptances can be misleading.)
For model choice, raw
contains estimated model weights for each input
configuration, and diag
is a data frame of tol/eps value, model,
diagnostic name and p-value (NA under the same conditions as before.)
In both cases, raw
also reports the number of acceptances. Note that
raw
contains p0 estimates/weights of NA if regression correction is
requested but there are too few acceptances to compute it.
Dennis Prangle
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning approximate Bayesian computation analyses. The R Journal 7, Issue 2, 189–205.
Prangle D., Blum M. G. B., Popovic G., Sisson S. A. (2014) Diagnostic tools of approximate Bayesian computation using the coverage property. Australian and New Zealand Journal of Statistics 56, Issue 4, 309–329.
mc.ci
for a diagnostic plot of raw model choice results
abc
and postpr
to perform ABC for a given dataset
##The examples below are chosen to run relatively quickly (<5 mins) ##and do not represent recommended tuning choices. ## Not run: data(musigma2) library(ggplot2) ##Parameter inference example parameters <- data.frame(par.sim) sumstats <- data.frame(stat.sim) covdiag <- cov.pi(param=parameters, sumstat=sumstats, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("KS")) #Plot of diagnostic results qplot(x=tol, y=pvalue, facets=.~parameter, data=covdiag$diag) #Plot of raw results for tol=0.5 qplot(x=mu, data=subset(covdiag$raw, tol==0.5)) #Plot of raw results for tol=0.5 qplot(x=sigma2, data=subset(covdiag$raw, tol==0.5)) #Compute CGR statistic and plot cgrout <- covstats.pi(covdiag$raw, diagnostics="CGR") qplot(x=tol, y=pvalue, facets=.~parameter, data=cgrout) ##Model choice example, based on simple simulated data index <- sample(1:2, 1E4, replace=TRUE) sumstat <- ifelse(index==1, rnorm(1E4,0,1), rnorm(1E4,0,rexp(1E4,1))) sumstat <- data.frame(ss=sumstat) covdiag <- cov.mc(index=index, sumstat=sumstat, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("freq")) qplot(x=tol, y=pvalue, data=covdiag$diag) llout <- covstats.mc(covdiag$raw, index=index, diagnostics="loglik.binary") qplot(x=tol, y=pvalue, data=llout) mc.ci(covdiag$raw, tol=0.5, modname=1, modtrue=index[1:200]) ## End(Not run)
##The examples below are chosen to run relatively quickly (<5 mins) ##and do not represent recommended tuning choices. ## Not run: data(musigma2) library(ggplot2) ##Parameter inference example parameters <- data.frame(par.sim) sumstats <- data.frame(stat.sim) covdiag <- cov.pi(param=parameters, sumstat=sumstats, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("KS")) #Plot of diagnostic results qplot(x=tol, y=pvalue, facets=.~parameter, data=covdiag$diag) #Plot of raw results for tol=0.5 qplot(x=mu, data=subset(covdiag$raw, tol==0.5)) #Plot of raw results for tol=0.5 qplot(x=sigma2, data=subset(covdiag$raw, tol==0.5)) #Compute CGR statistic and plot cgrout <- covstats.pi(covdiag$raw, diagnostics="CGR") qplot(x=tol, y=pvalue, facets=.~parameter, data=cgrout) ##Model choice example, based on simple simulated data index <- sample(1:2, 1E4, replace=TRUE) sumstat <- ifelse(index==1, rnorm(1E4,0,1), rnorm(1E4,0,rexp(1E4,1))) sumstat <- data.frame(ss=sumstat) covdiag <- cov.mc(index=index, sumstat=sumstat, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("freq")) qplot(x=tol, y=pvalue, data=covdiag$diag) llout <- covstats.mc(covdiag$raw, index=index, diagnostics="loglik.binary") qplot(x=tol, y=pvalue, data=llout) mc.ci(covdiag$raw, tol=0.5, modname=1, modtrue=index[1:200]) ## End(Not run)
Plots credible interval estimates for raw model choice output from
cov.mc
. This is used to investigate whether the coverage
property holds and validate whether diagnostic statistics are acting
as intended.
mc.ci(raw, tol, eps, modname, modtrue, nbins=5, bintype=c("interval", "quantile"), bw=FALSE, ...)
mc.ci(raw, tol, eps, modname, modtrue, nbins=5, bintype=c("interval", "quantile"), bw=FALSE, ...)
raw |
The |
tol |
The value of |
eps |
The value of |
modname |
The name of the model to test. |
modtrue |
Vector containing the true models generating the pseudo-observed
test data. i.e. |
nbins |
Number of bins to display. |
bintype |
How to choose the bins (see Details). |
bw |
Whether to produce a black and white image. Default is FALSE. Colour is used to make different bins stand out. |
... |
Additional plotting arguments - anything that can be used by |
This function provides a plot which can be used as an informal test of
the model choice coverage hypothesis for a particular value of
eps
or tol
and choice of model. The plot is more
flexible than the diagnostics, but not suitable as the basis of a
formal test.
For each pseudo-observed data set, the ABC probability that the model
is modname
is taken from raw
, and the true model is
taken from modtrue
. The probabilities are binned into
nbins
intervals, either of equal length or based on nbins+1
equally spaced empirical quantiles. The function estimates the
observed probability of modname
within each bin using Bayesian
inference for a binomial proportion under a uniform prior. The plot
shows the mean and 95% credible interval plotted against predicted
probabilities. Informally, the coverage property should be rejected
if predicted values are too unlikely given the observed values.
Dennis Prangle
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning approximate Bayesian computation analyses. The R Journal 7, Issue 2, 189–205.
Prangle D., Blum M. G. B., Popovic G., Sisson S. A. (2014) Diagnostic tools of approximate Bayesian computation using the coverage property. Australian and New Zealand Journal of Statistics 56, Issue 4, 309–329.
cov.mc
to produce the input for this function
##The examples below are chosen to run relatively quickly (<5 mins) ##and do not represent recommended tuning choices. ## Not run: index <- sample(1:2, 1E4, replace=TRUE) sumstat <- ifelse(index==1, rnorm(1E4,0,1), rnorm(1E4,0,rexp(1E4,1))) sumstat <- data.frame(ss=sumstat) covdiag <- cov.mc(index=index, sumstat=sumstat, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("freq")) mc.ci(covdiag$raw, tol=0.5, modname=1, modtrue=index[1:100]) ## End(Not run)
##The examples below are chosen to run relatively quickly (<5 mins) ##and do not represent recommended tuning choices. ## Not run: index <- sample(1:2, 1E4, replace=TRUE) sumstat <- ifelse(index==1, rnorm(1E4,0,1), rnorm(1E4,0,rexp(1E4,1))) sumstat <- data.frame(ss=sumstat) covdiag <- cov.mc(index=index, sumstat=sumstat, testsets=1:100, tol=seq(0.1,1,by=0.1), diagnostics=c("freq")) mc.ci(covdiag$raw, tol=0.5, modname=1, modtrue=index[1:100]) ## End(Not run)
The function cycles through all possible subsets of summary statistics and computes a criterion from the posterior sample. The subset which achieves the minimum is chosen as the most informative subset.
mincrit(obs, param, sumstats, obspar = NULL, abcmethod = abc, crit = nn.ent, sumsubs = 1:ncol(sumstats), limit=length(sumsubs), do.only = NULL, verbose = TRUE, do.crit = TRUE, do.err = FALSE, final.dens = FALSE, errfn = rsse, ...)
mincrit(obs, param, sumstats, obspar = NULL, abcmethod = abc, crit = nn.ent, sumsubs = 1:ncol(sumstats), limit=length(sumsubs), do.only = NULL, verbose = TRUE, do.crit = TRUE, do.err = FALSE, final.dens = FALSE, errfn = rsse, ...)
obs |
(matrix of) observed summary statistics. |
param |
matrix of simulated model parameter values. |
sumstats |
matrix of simulated summary statistics. |
obspar |
optional observed parameters (for use to assess simulation performance). |
abcmethod |
a function to perform ABC inference, e.g. the |
crit |
a function to minimize to measure information from a posterior sample, e.g. |
sumsubs |
an optional index into the summary statistics to limit summary selection to a specific subset of summaries. |
limit |
an optional integer indicating whether to limit summary selection to subsets of a maximum size. |
do.only |
an optional index into the summary statistics combination table. Can be used to limit entropy calculations to certain summary statistics subsets only. |
verbose |
a boolean value indicating whether informative statements should be printed to screen. |
do.crit |
a boolean value indicating whether the measure on the posterior sample should be returned. |
do.err |
a boolean value indicating whether the simulation error should be returned. Note: if |
final.dens |
a boolean value indicating whether the posterior sample should be returned. |
errfn |
an error function to assess ABC inference performance. |
... |
any other optional arguments to the ABC inference procedure (e.g. arguments to the |
The function uses a criterion (e.g.sample entropy) as a proxy for information in a posterior sample. The criterion for each possible subset of statistics is computed, and the best subset is judged as the one which minimises this vector of values.
A list with the following components:
best |
the best subset(s) of statistics. |
critvals |
the calculated criterion values (if |
err |
simulation error (if |
order |
the subsets considered during the algorithm (same as the input |
post.sample |
an array of dimension |
sumsubs |
an index into the subsets considered during the algorithm. |
These functions are computationally intensive due to the cyclic ABC inference procedure.
Matt Nunes
Nunes, M. A. and Balding, D. J. (2010) On Optimal Selection of Summary Statistics for Approximate Bayesian Computation. Stat. Appl. Gen. Mol. Biol. 9, Iss. 1, Art. 34.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning approximate Bayesian computation analyses. The R Journal 7, Issue 2, 189–205.
# load example data: data(coal) data(coalobs) param<-coal[1:50000,2] simstats<-coal[1:50000,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) obsparam<-matrix(coalobs[1,1]) # example of entropy minimization algorithm: tmp <-mincrit(obsstats, param, simstats, tol=.01, method="rejection", do.crit=TRUE) tmp$critvals
# load example data: data(coal) data(coalobs) param<-coal[1:50000,2] simstats<-coal[1:50000,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) obsparam<-matrix(coalobs[1,1]) # example of entropy minimization algorithm: tmp <-mincrit(obsstats, param, simstats, tol=.01, method="rejection", do.crit=TRUE) tmp$critvals
The function computes the k nearest neighbour sample entropy.
nn.ent(th, k=4)
nn.ent(th, k=4)
th |
The sample from which to compute the entropy. |
k |
The order (number of neighbours) of the sample entropy calculation. |
The sample entropy gives a measure of information in a (posterior) sample, or lack of it.
The k nearest neighbour entropy from the sample.
For high-dimensional posterior samples, the nn.ent
calculation is quite computationally intensive.
Matt Nunes
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
Singh, H. et al. (2003) Nearest neighbor estimates of entropy. Am. J. Math. Man. Sci.,23, 301–321.
Shannon, C. E. and Weaver, W. (1948) A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423.
# create a dummy sample to calculate an entropy measure: theta<-rnorm(10000) nn.ent(theta)
# create a dummy sample to calculate an entropy measure: theta<-rnorm(10000) nn.ent(theta)
This function post-processes ABC output with the aim of calibrating its credible intervals to have the correct probabilities. The approach can be thought of as extending coverage tests to correct deviations from the desired posterior. See the reference for details.
recalibrationABC(target, param, sumstat, eps, tol, method="rejection", multicore=FALSE, cores=NULL, abc.p.options=list(method="loclinear"), ...)
recalibrationABC(target, param, sumstat, eps, tol, method="rejection", multicore=FALSE, cores=NULL, abc.p.options=list(method="loclinear"), ...)
target |
A vector of the observed summary statistics. |
param |
A data frame of the simulated parameter values. |
sumstat |
A data frame of the simulated summary statistics. |
eps |
The ABC acceptance threshold i.e. max acceptable distance. This or
|
tol |
The ABC acceptance tolerance i.e. proportion of simulations to
accept. This or |
method |
A character string indicating the type of ABC algorithm to be applied. Possible values are "rejection", "loclinear", "neuralnet" and "ridge". |
multicore |
Whether to use the |
cores |
Number of cores to use when |
abc.p.options |
A list of further arguments to be supplied to the |
... |
Further arguments to be supplied to the |
A list with the following components is returned.
sample.abc
is the ordinary ABC output sample (with any
regression correction requested). The rows represent accepted samples
and the columns represent the parameters.
sample.recal
is the ABC output sample following
recalibration. The rows are Monte Carlo draws from an approximation to
the posterior and the columns represent the parameters.
sample.regrecal
is the ABC output sample following coverage
correction including regression correction of p-values. It has a
similar form to sample.recal
.
weights
are weights for the ABC output. These apply to all
types of ABC.
Each row of pvalues
corresponds to a particular accepted
dataset. It gives the p-values of the true parameters within the ABC
sample generated from this data.
Each row of pvalues.reg
corresponds to pvalues
after
a regression-adjustment has been performed on them.
svalues
is a subset of the rows of sumstat
corresponding
to accepted datasets. These can be used in conjunction with
pvalues
to perform a recalibration correction manually.
Dennis Prangle and Guilherme Rodrigues
G. S. Rodrigues, D. Prangle and S. A. Sisson (2017) Recalibration: A post-processing method for approximate Bayesian computation. In submission
## Not run: data(musigma2) P <- data.frame(par.sim) S <- data.frame(stat.sim) out <- recalibrationABC(target=stat.obs, param=P, sumstat=S, tol=0.3) plot(rbind(out$sample.plain[1:500,], out$sample.recal[1:500,]), col=c(rep("red",500), rep("blue", 500))) ##Red shows plain ABC sample, blue shows recalibrated output ## End(Not run)
## Not run: data(musigma2) P <- data.frame(par.sim) S <- data.frame(stat.sim) out <- recalibrationABC(target=stat.obs, param=P, sumstat=S, tol=0.3) plot(rbind(out$sample.plain[1:500,], out$sample.recal[1:500,]), col=c(rep("red",500), rep("blue", 500))) ##Red shows plain ABC sample, blue shows recalibrated output ## End(Not run)
Computes error measures between two sets of data.
rsse(a, b, v = 1) sse(a, b, v = 1)
rsse(a, b, v = 1) sse(a, b, v = 1)
a |
Dataset 1, of dimension (n x p) |
b |
Dataset 2: of dimension (1 x p) or a vector. |
v |
an optional factor to normalise the data before computation of the RSSE. |
The RSSE between dataset a
and b
.
Matt Nunes
a<-matrix(rnorm(1000),ncol=2) b<-runif(2) rsse(a,b)
a<-matrix(rnorm(1000),ncol=2) b<-runif(2) rsse(a,b)
saABC
fits parameter estimators based on simulated data to be
used as summary statistics within ABC. Fitting is by linear
regression. Some simple diagnostics are provided for assistance.
saABC(theta, X, plot = TRUE)
saABC(theta, X, plot = TRUE)
theta |
A n x d matrix or data frame of simulated parameter values.
|
X |
A n x p matrix or data frame of simulated data and/or associated
transformations. |
plot |
When |
The semi-automatic ABC method of Fearnhead and Prangle (2012) is as follows:
1) Simulate parameter vectors and corresponding data sets
for i=1,2,...,N.
2) Use the simulations to fit an estimator of each parameter as a linear combination of f(x), where f(x) is a vector of transformations of x (including a constant term).
3) Run ABC using these simulations.
The saABC
function automates step 2 of this process. The user
must supply simulated parameter values theta
and corresponding
f(x) values x
(n.b. excluding the constant term). The function
returns weights for the linear combinations which can easily be used
for step 3. In particular, fitted weights are returned as a matrix
of weights for the columns of x
and a vector of constants. The
vector can usually be discarded, as it is not needed to find
differences between summary statistics.
The function also returns BIC values for each parameter so that the user can judge the quality of the fits, and compare different choices of f(x). Diagnostic plots of supplied parameter values against fitted values are also optionally provided. These are useful for exploratory purposes when there are a small number of parameters, but provide less protection from overfitting than BIC values.
B0 |
Vector of constant terms from fitted regressions. |
B |
Matrix of weights from fitted regressions. |
BICs |
Vector of BIC values for each fitted regression. |
Dennis Prangle
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation:
semi-automatic approximate Bayesian
computation. J. R. Stat. Soc. B 74, Part 3, 1–28.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
set.seed(1) theta <- matrix(runif(2E3),ncol=2) colnames(theta) <- c("Mean", "Variance") X <- replicate(5, rnorm(1E3, theta[,1], theta[,2])) saABC(theta, X)$BICs saABC(theta, cbind(X, X^2))$BICs ##Variance parameter estimated better
set.seed(1) theta <- matrix(runif(2E3),ncol=2) colnames(theta) <- c("Mean", "Variance") X <- replicate(5, rnorm(1E3, theta[,1], theta[,2])) saABC(theta, X)$BICs saABC(theta, cbind(X, X^2))$BICs ##Variance parameter estimated better
The function implements functions which implement summary statistics selection methods.
selectsumm(obs, param, sumstats, obspar=NULL, ssmethod = mincrit, verbose = TRUE, final.dens = FALSE, ...)
selectsumm(obs, param, sumstats, obspar=NULL, ssmethod = mincrit, verbose = TRUE, final.dens = FALSE, ...)
obs |
(matrix of) observed summary statistics. |
param |
matrix of simulated model parameter values. |
sumstats |
matrix of simulated summary statistics. |
obspar |
optional observed parameters (for use to assess simulation performance). |
ssmethod |
a function to perform summary statistics selection. Current methods are |
verbose |
a boolean value indicating whether informative statements should be printed to screen. |
final.dens |
a boolean value indicating whether the posterior sample should be returned. |
... |
any other optional arguments to the summary selection procedure. |
The function is essentially a wrapper for more specific summary selection methods, and is designed to be flexible for future additions and minimization criteria. See the help files for each summary selection method for more details.
A list with the following components:
best |
the best subset(s) of statistics. |
critvals |
the calculated criterion values (if |
err |
simulation error (if |
order |
the subsets considered during the algorithm (same as the input |
post.sample |
an array of dimension |
sumsubs |
an index into the subsets considered during the algorithm. |
Matt Nunes
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. B 74, Part 3, 1–28.
Joyce, P. and P. Marjoram (2008) Approximately sufficient statistics and Bayesian computation. Stat. Appl. Gen. Mol. Biol. 7 Article 26.
Nunes, M. A. and Balding, D. J. (2010) On Optimal Selection of Summary
Statistics for Approximate Bayesian Computation.
Stat. Appl. Gen. Mol. Biol. 9, Iss. 1, Art. 34.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) tmp<-selectsumm(obsstats, param, simstats, ssmethod =AS.select, tol =.1, method = "rejection", allow.none = FALSE, inturn = TRUE, hcorr = TRUE) tmp$best
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) tmp<-selectsumm(obsstats, param, simstats, ssmethod =AS.select, tol =.1, method = "rejection", allow.none = FALSE, inturn = TRUE, hcorr = TRUE) tmp$best
Performs semi-automatic ABC based on summary statistics regression.
semiauto.abc(obs, param, sumstats, obspar=NULL, abcmethod = abc, saprop = 0.5, abcprop = 0.5, overlap = FALSE, satr = list(), plot = FALSE, verbose = TRUE, do.err = FALSE, final.dens = FALSE, errfn = rsse, ...)
semiauto.abc(obs, param, sumstats, obspar=NULL, abcmethod = abc, saprop = 0.5, abcprop = 0.5, overlap = FALSE, satr = list(), plot = FALSE, verbose = TRUE, do.err = FALSE, final.dens = FALSE, errfn = rsse, ...)
obs |
(matrix of) observed summary statistics. |
param |
matrix of simulated model parameter values. |
sumstats |
matrix of simulated summary statistics. |
obspar |
optional observed parameters (for use to assess simulation performance). |
abcmethod |
a function to perform ABC inference, e.g. the |
saprop |
a proportion, denoting the proportion of simulated datasets with which to perform semi-automatic ABC regression. |
abcprop |
a proportion, denoting the proportion of simulated datasets with which to perform ABC using |
overlap |
a boolean value indicating whether the simulated datasets specified by |
satr |
a list of functions indicating transformations of the summary statistics |
plot |
When plot==TRUE, a plot of parameter values against fitted values is produced for each parameter as a side-effect. This is most useful when the number of parameters is reasonably small. |
verbose |
a boolean value indicating whether informative statements should be printed to screen. |
do.err |
a boolean value indicating whether the simulation error should be returned. Note: if |
final.dens |
a boolean value indicating whether the posterior sample should be returned. |
errfn |
an error function to assess ABC inference performance. |
... |
any other optional arguments to the ABC inference procedure (e.g. arguments to the |
This function is essentially a wrapper for saABC
. See the details section of saABC
for more details on the implementation. The argument satr
can be almost anything sensible in function
form, see Examples section for example specifications.
A list with the following components:
err |
simulation error (if |
post.sample |
an array of dimension |
sainfo |
A list with the following information about the semi-automatic ABC run: |
The argument satr
must be supplied with valid functions. Whilst there are checks, these are minimal, since doing sophisticated checks is quite difficult.
Matt Nunes and Dennis Prangle
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation:
semi-automatic approximate Bayesian
computation. J. R. Stat. Soc. B 74, Part 3, 1–28.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
## Not run: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) obsparam<-matrix(coalobs[1,1]) # perform semi-automatic ABC with summary statistics defined by # X, X^2,X^3,X^4: # other alternative specifications for this could be: # list(function(x){ cbind(x,x^2,x^3,x^4) }) # list(as.function(alist(x=,cbind(x,x^2,x^3)))) etc tmp<-semiauto.abc(obsstats, param, simstats,tol=.01,method="rejection", satr=list(function(x){outer(x,Y=1:4,"^")})) tmp$sa.info # both these functions may be problematic: tmp<-semiauto.abc(obsstats, param, simstats,tol=.01,method="rejection", satr=list(unique,sum)) ## End(Not run)
## Not run: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,4:6] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,4:6],nrow=1) obsparam<-matrix(coalobs[1,1]) # perform semi-automatic ABC with summary statistics defined by # X, X^2,X^3,X^4: # other alternative specifications for this could be: # list(function(x){ cbind(x,x^2,x^3,x^4) }) # list(as.function(alist(x=,cbind(x,x^2,x^3)))) etc tmp<-semiauto.abc(obsstats, param, simstats,tol=.01,method="rejection", satr=list(function(x){outer(x,Y=1:4,"^")})) tmp$sa.info # both these functions may be problematic: tmp<-semiauto.abc(obsstats, param, simstats,tol=.01,method="rejection", satr=list(unique,sum)) ## End(Not run)
Summary statistics selection for ABC inference using estimated posterior error.
stage2(obs, param, sumstats, obspar = NULL, init.best, dsets = 100, sumsubs = 1:ncol(sumstats), limit = length(sumsubs), do.only=NULL, do.err = FALSE, final.dens = FALSE, ...)
stage2(obs, param, sumstats, obspar = NULL, init.best, dsets = 100, sumsubs = 1:ncol(sumstats), limit = length(sumsubs), do.only=NULL, do.err = FALSE, final.dens = FALSE, ...)
obs |
observed summary statistics. |
param |
matrix of simulated model parameter values. |
sumstats |
matrix of simulated summary statistics. |
obspar |
optional observed parameters (for use to assess simulation performance). |
init.best |
an initial estimate of the best summary statistics subset. Can be either an index into the summaries combination table (see |
dsets |
the number of simulated datasets to treat as observed when estimating the posterior error. See details. |
sumsubs |
an optional index into the summary statistics to limit summary selection to a specific subset of summaries. |
limit |
an optional integer indicating whether to limit summary selection to subsets of a maximum size. |
do.only |
an optional index into the summary statistics combination table. Can be used to limit entropy calculations to certain summary statistics subsets only. |
do.err |
a boolean value indicating whether the simulation error should be returned. Note: if |
final.dens |
a boolean value indicating whether the posterior sample should be returned. |
... |
any other optional arguments to the ABC inference procedure (e.g. arguments to the |
The function uses the init.best
set of summaries to determine the dsets
simulated datasets which are closest (in Euclidean norm) to the observed dataset. Since the model parameters generating the summary statistics are known for these simulated datasets, for each candidate subset of summary statistics, we can compute the error under ABC inference for each of these datasets. The best subset of summary statistics is that which minimizes this (average) error over all dsets
datasets.
A list with the following components:
best |
the best subset of statistics. |
closest |
the indices of the |
err |
simulation error (if |
order |
the subsets considered during the algorithm (same as the input |
post.sample |
an array of dimension |
sumsubs |
an index into the subsets considered during the algorithm. |
This function is computationally intensive due to its cyclic ABC inference procedure.
Matt Nunes
Blum, M. G. B, Nunes, M. A., Prangle, D. and Sisson, S. A. (2013) A
comparative review of dimension reduction methods in approximate
Bayesian computation. Stat. Sci. 28, Issue 2, 189–208.
Nunes, M. A. and Balding, D. J. (2010) On Optimal Selection of Summary
Statistics for Approximate Bayesian Computation.
Stat. Appl. Gen. Mol. Biol. 9, Iss. 1, Art. 34.
Nunes, M. A. and Prangle, D. (2016) abctools: an R package for tuning
approximate Bayesian computation analyses. The R Journal
7, Issue 2, 189–205.
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,5:8] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,5:8],nrow=1) obsparam<-matrix(coalobs[1,1]) ## Not run: tmp<-stage2(obsstats, param, simstats, init.bes=c(1,3), dsets = 10) tmp$best ## End(Not run)
# load example data: data(coal) data(coalobs) param<-coal[,2] simstats<-coal[,5:8] # use matrix below just in case to preserve dimensions. obsstats<-matrix(coalobs[1,5:8],nrow=1) obsparam<-matrix(coalobs[1,1]) ## Not run: tmp<-stage2(obsstats, param, simstats, init.bes=c(1,3), dsets = 10) tmp$best ## End(Not run)