Title: | EBGM Disproportionality Scores for Adverse Event Data Mining |
---|---|
Description: | An implementation of DuMouchel's (1999) <doi:10.1080/00031305.1999.10474456> Bayesian data mining method for the market basket problem. Calculates Empirical Bayes Geometric Mean (EBGM) and posterior quantile scores using the Gamma-Poisson Shrinker (GPS) model to find unusually large cell counts in large, sparse contingency tables. Can be used to find unusually high reporting rates of adverse events associated with products. In general, can be used to mine any database where the co-occurrence of two variables or items is of interest. Also calculates relative and proportional reporting ratios. Builds on the work of the 'PhViD' package, from which much of the code is derived. Some of the added features include stratification to adjust for confounding variables and data squashing to improve computational efficiency. Includes an implementation of the EM algorithm for hyperparameter estimation loosely derived from the 'mederrRank' package. |
Authors: | John Ihrie [cre, aut], Travis Canida [aut], Ismaïl Ahmed [ctb] (author of 'PhViD' package (derived code)), Antoine Poncet [ctb] (author of 'PhViD'), Sergio Venturini [ctb] (author of 'mederrRank' package (derived code)), Jessica Myers [ctb] (author of 'mederrRank') |
Maintainer: | John Ihrie <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.9.1 |
Built: | 2024-10-25 06:42:05 UTC |
Source: | CRAN |
autoHyper
finds a single hyperparameter estimate using an algorithm
that evaluates results from multiple starting points (see
exploreHypers
). The algorithm verifies that the optimization
converges within the bounds of the parameter space and that the chosen
estimate (smallest negative log-likelihood) is similar to at least
one (see min_conv
argument) of the other convergent solutions.
autoHyper( data, theta_init, squashed = TRUE, zeroes = FALSE, N_star = 1, tol = c(0.05, 0.05, 0.2, 0.2, 0.025), min_conv = 1, param_limit = 100, max_pts = 20000, conf_ints = FALSE, conf_level = c("95", "80", "90", "99") )
autoHyper( data, theta_init, squashed = TRUE, zeroes = FALSE, N_star = 1, tol = c(0.05, 0.05, 0.2, 0.2, 0.025), min_conv = 1, param_limit = 100, max_pts = 20000, conf_ints = FALSE, conf_level = c("95", "80", "90", "99") )
data |
A data frame from |
theta_init |
A data frame of initial hyperparameter guesses with
columns ordered as:
|
squashed |
A scalar logical ( |
zeroes |
A scalar logical specifying if zero counts are included. |
N_star |
A positive scalar whole number value for the minimum count
size to be used for hyperparameter estimation. If zeroes are used, set
|
tol |
A numeric vector of tolerances for determining how close the
chosen estimate must be to at least |
min_conv |
A scalar positive whole number for defining the minimum
number of convergent solutions that must be close to the convergent
solution with the smallest negative log-likelihood. Must be at least one
and at most one less than the number of rows in |
param_limit |
A scalar numeric value for the largest acceptable value
for the |
max_pts |
A scalar whole number for the largest number of data points allowed. Used to help prevent extremely long run times. |
conf_ints |
A scalar logical indicating if confidence intervals and standard errors should be returned. |
conf_level |
A scalar string for the confidence level used if confidence intervals are requested. |
The algorithm first attempts to find a consistently convergent
solution using nlminb
. If it fails, it will next try
nlm
. If it still fails, it will try
optim
(method = "BFGS"). If all three
approaches fail, the function returns an error message.
Since this function runs multiple optimization procedures, it is
best to start with 5 or less initial starting points (rows in
theta_init
). If the function runs in a reasonable amount of time,
this number can be increased.
This function should not be used with very large data sets since
each optimization call will take a long time. squashData
can
be used first to reduce the size of the data.
It is recommended to use N_star = 1
when practical. Data
squashing (see squashData
) can be used to further reduce the
number of data points.
Asymptotic normal confidence intervals, if requested, use standard errors calculated from the observed Fisher information matrix as discussed in DuMouchel (1999).
A list containing the following elements:
method: A scalar character string for the method used to
find the hyperparameter estimate (possibilities are
“nlminb
”, “nlm
”, and
“bfgs
”).
estimates: A named numeric vector of length 5 for the hyperparameter estimate corresponding to the smallest log-likelihood.
conf_int: A data frame including the standard errors and
confidence limits. Only included if conf_ints = TRUE
.
num_close: A scalar integer for the number of other convergent solutions that were close (within tolerance) to the chosen estimate.
theta_hats: A data frame for the estimates corresponding
to the initial starting points defined by theta_init
. See
exploreHypers
.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
nlminb
, nlm
, and
optim
for optimization details
squashData
for data preparation
Other hyperparameter estimation functions:
exploreHypers()
,
hyperEM()
data.table::setDTthreads(2) #only needed for CRAN checks #Start with 2 or more guesses theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(squashed, theta_init = theta_init) ) print(hypers)
data.table::setDTthreads(2) #only needed for CRAN checks #Start with 2 or more guesses theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(squashed, theta_init = theta_init) ) print(hypers)
autoSquash
squashes data by calling squashData
once for
each count (N), removing the need to repeatedly squash the same data
set.
autoSquash( data, keep_pts = c(100, 75, 50, 25), cut_offs = c(500, 1000, 10000, 1e+05, 5e+05, 1e+06, 5e+06), num_super_pts = c(50, 75, 150, 500, 750, 1000, 2000, 5000) )
autoSquash( data, keep_pts = c(100, 75, 50, 25), cut_offs = c(500, 1000, 10000, 1e+05, 5e+05, 1e+06, 5e+06), num_super_pts = c(50, 75, 150, 500, 750, 1000, 2000, 5000) )
data |
A data frame (typically from |
keep_pts |
A vector of whole numbers for the number of points to leave unsquashed for each count (N). See the 'Details' section. |
cut_offs |
A vector of whole numbers for the cutoff values of unsquashed data used to determine how many "super points" to end up with after squashing each count (N). See the 'Details' section. |
num_super_pts |
A vector of whole numbers for the number of
"super points" to end up with after squashing each count (N). Length
must be 1 more than length of |
See squashData
for details on squashing a given
count (N).
The elements in keep_pts
determine how many points are left
unsquashed for each count (N). The first element in keep_pts
is used for the smallest N (usually 1). Each successive element is
used for each successive N. Once the last element is reached, it is
used for all other N.
For counts that are squashed, cut_offs
and
num_super_pts
determine how the points are squashed. For instance,
by default, if a given N contains less than 500 points to be
squashed, then those points are squashed to 50 "super points".
A data frame with column names N, E, and weight containing the reduced data set.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
processRaw
for data preparation and
squashData
for squashing individual counts
data.table::setDTthreads(2) #only needed for CRAN checks data(caers) proc <- processRaw(caers) table(proc$N) squash1 <- autoSquash(proc) ftable(squash1[, c("N", "weight")]) ## Not run: squash2 <- autoSquash(proc, keep_pts = c(50, 5)) ## Not run: ftable(squash2[, c("N", "weight")]) ## Not run: squash3 <- autoSquash(proc, keep_pts = 100, cut_offs = c(250, 500), num_super_pts = c(20, 60, 125)) ## End(Not run) ## Not run: ftable(squash3[, c("N", "weight")])
data.table::setDTthreads(2) #only needed for CRAN checks data(caers) proc <- processRaw(caers) table(proc$N) squash1 <- autoSquash(proc) ftable(squash1[, c("N", "weight")]) ## Not run: squash2 <- autoSquash(proc, keep_pts = c(50, 5)) ## Not run: ftable(squash2[, c("N", "weight")]) ## Not run: squash3 <- autoSquash(proc, keep_pts = 100, cut_offs = c(250, 500), num_super_pts = c(20, 60, 125)) ## End(Not run) ## Not run: ftable(squash3[, c("N", "weight")])
A dataset for dietary supplement adverse event reports from 2012 containing CAERS product and adverse event reports as reported to the FDA. This particular dataset contains only products which were reported to be dietary supplements (industry code 54) reported in the year 2012. and includes 2874 unique product names and 1328 unique adverse events. There are a total of 3356 unique reports. In addition, there is also one stratification variable, indicating whether the patient is male or female
caers
caers
A data frame with 20156 rows and 4 variables:
id
Identification number
var1
Name of the product
var2
Name of the symptom/event category
strat1
Gender of the patient associated with report
Further details about the data can be found using the links below.
CFSAN Adverse Event Reporting System (FDA Center for Food Safety and Nutrition)
https://www.fda.gov/food/compliance-enforcement-food
https://www.fda.gov/media/97035/download
A small subset of raw, publicly available CAERS data used to demonstrate how to prepare data for use by openEBGM's functions.
caers_raw
caers_raw
A data frame with 117 rows and 6 variables:
RA_Report..
CAERS report identification number.
PRI_Reported.Brand.Product.Name
The verbatim brands and/or product names indicated to have been used by the consumer reported to have experienced the adverse event.
CI_Age.at.Adverse.Event
The age of the consumer reported to
have experienced the adverse event, in units specified by
CI_Age.Unit
.
CI_Age.Unit
The time unit (day, week, month, year) of the
age provided in the CI_Age.at.Adverse.Event
data field for the
consumer reported to have experienced the adverse event.
CI_Gender
The sex of the individual reported to have experienced the adverse event.
SYM_One.Row.Coded.Symptoms
The symptom(s) experienced by the injured consumer as specified by the reporter and coded by FDA according to the Medical Data Dictionary for Regulatory Activities (MedDRA).
The column names appear exactly as they would if you had used
read.csv()
to pull the data directly from the website below.
Further details about the data can be found using the links below.
CFSAN Adverse Event Reporting System (FDA Center for Food Safety and Nutrition)
https://www.fda.gov/food/compliance-enforcement-food
https://www.fda.gov/media/97035/download
ebgm
calculates the Empirical Bayes Geometric Mean (EBGM),
which is ‘the geometric mean of the empirical Bayes posterior
distribution of the “true” RR’ (DuMouchel 1999, see Eq.11). The
EBGM is essentially a version of the relative reporting ratio
(RR) that uses Bayesian shrinkage.
ebgm(theta_hat, N, E, qn, digits = 2)
ebgm(theta_hat, N, E, qn, digits = 2)
theta_hat |
A numeric vector of hyperparameter estimates (likely from
|
N |
A whole number vector of actual counts from
|
E |
A numeric vector of expected counts from |
qn |
A numeric vector of posterior probabilities that |
digits |
A scalar whole number that determines the number of decimal places used when rounding the results. |
The hyperparameter estimates (theta_hat
) are:
: Parameter estimates of the first
component of the prior distribution
: Parameter estimates of the second
component
: Mixture fraction estimate of the prior distribution
A numeric vector of EBGM scores.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
autoHyper
, exploreHypers
,
negLLsquash
, negLL
,
negLLzero
, and negLLzeroSquash
for
hyperparameter estimation.
processRaw
for finding counts.
Qn
for finding mixture fractions.
Other posterior distribution functions:
Qn()
,
quantBisect()
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) proc$EBGM <- ebgm(theta_hat, N = proc$N, E = proc$E, qn = qn) head(proc)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) proc$EBGM <- ebgm(theta_hat, N = proc$N, E = proc$E, qn = qn) head(proc)
ebScores
calculates EBGM scores as well as the quantiles from the
posterior distribution and returns an object of class openEBGM.
ebScores(processed, hyper_estimate, quantiles = c(5, 95), digits = 2)
ebScores(processed, hyper_estimate, quantiles = c(5, 95), digits = 2)
processed |
A data frame resulting from running |
hyper_estimate |
A list resulting from running |
quantiles |
Either a numeric vector of desired quantiles to be calculated from the posterior distribution or NULL for no calculation of quantiles. |
digits |
A whole number scalar specifying how many decimal places to use for rounding EBGM and the quantiles scores. |
This function takes the processed data as well as the hyperparameter estimates and instantiates an object of class openEBGM. This object then contains additional calculations, such as the EBGM score, and the quantiles that are supplied by the quantiles parameter at the time of calling the function.
The function allows for the choice of an arbitrary amount of quantiles or no quantiles at all to be calculated. This may be helpful for large datasets, or when the EBGM score is the only metric of interest.
An openEBGM object (class S3) containing:
data: A data frame containing the results (scores, etc.).
hyper_parameters: A list containing the hyperparameter estimation results.
quantiles: The chosen percentiles.
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(data = squashed, theta_init = theta_init) ) obj <- ebScores(processed = proc, hyper_estimate = hypers, quantiles = 5) print(obj)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(data = squashed, theta_init = theta_init) ) obj <- ebScores(processed = proc, hyper_estimate = hypers, quantiles = 5) print(obj)
exploreHypers
finds hyperparameter estimates using a variety of
starting points to examine the consistency of the optimization procedure.
exploreHypers( data, theta_init, squashed = TRUE, zeroes = FALSE, N_star = 1, method = c("nlminb", "nlm", "bfgs"), param_limit = 100, max_pts = 20000, std_errors = FALSE )
exploreHypers( data, theta_init, squashed = TRUE, zeroes = FALSE, N_star = 1, method = c("nlminb", "nlm", "bfgs"), param_limit = 100, max_pts = 20000, std_errors = FALSE )
data |
A data frame from |
theta_init |
A data frame of initial hyperparameter guesses with
columns ordered as:
|
squashed |
A scalar logical ( |
zeroes |
A scalar logical specifying if zero counts are included. |
N_star |
A positive scalar whole number value for the minimum count
size to be used for hyperparameter estimation. If zeroes are used, set
|
method |
A scalar string indicating which optimization procedure is to
be used. Choices are |
param_limit |
A scalar numeric value for the largest acceptable value
for the |
max_pts |
A scalar whole number for the largest number of data points allowed. Used to help prevent extremely long run times. |
std_errors |
A scalar logical indicating if standard errors should be returned for the hyperparameter estimates. |
The method
argument determines which optimization procedure
is used. All the options use functions from the stats
package:
Since this function runs multiple optimization procedures, it is
best to start with 5 or less initial starting points (rows in
theta_init
). If the function runs in a reasonable amount of time,
this number can be increased.
This function should not be used with very large data sets unless data squashing is used first since each optimization call will take a long time.
It is recommended to use N_star = 1
when practical. Data
squashing (see squashData
) can be used to reduce the number
of data points.
The converge column in the resulting data frame was determined by examining the convergence code of the chosen optimization method. In some instances, the code is somewhat ambiguous. The determination of converge was intended to be conservative (leaning towards FALSE when questionable). See the documentation for the chosen method for details about code.
Standard errors, if requested, are calculated using the observed Fisher information matrix as discussed in DuMouchel (1999).
A list including the data frame estimates
of hyperparameter
estimates corresponding to the initial guesses from theta_init
(plus
convergence results):
code: The convergence code returned by the chosen
optimization function (see nlminb
,
nlm
, and optim
for details).
converge: A logical indicating whether or not convergence was reached. See "Details" section for more information.
in_bounds: A logical indicating whether or not the
estimates were within the bounds of the parameter space (upper bound
for was determined by
the
param_limit
argument).
minimum: The negative log-likelihood value corresponding to the estimated optimal value of the hyperparameter.
Also returns the data frame std_errs
if standard errors are
requested.
Make sure to properly specify the squashed
,
zeroes
, and N_star
arguments for your data set, since these
will determine the appropriate likelihood function. Also, this function
will not filter out data points. For instance, if you use N_star = 2
you must filter out the ones and zeroes (if present) from data
prior
to using this function.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
nlminb
, nlm
, and
optim
for optimization details
squashData
for data preparation
Other hyperparameter estimation functions:
autoHyper()
,
hyperEM()
data.table::setDTthreads(2) #only needed for CRAN checks #Start with 2 or more guesses theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( exploreHypers(squashed, theta_init = theta_init) )
data.table::setDTthreads(2) #only needed for CRAN checks #Start with 2 or more guesses theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( exploreHypers(squashed, theta_init = theta_init) )
hyperEM
finds hyperparameter estimates using a variation on the
Expectation-Maximization (EM) algorithm known as the Expectation/Conditional
Maximization (ECM) algorithm (Meng et al, 1993). The algorithm estimates each
element of the hyperparameter vector, , while holding fixed
(conditioning on) the other parameters in the vector. Alternatively, it can
estimate both parameters for each distribution in the mixture while holding
the parameters from the other distribution and the mixing fraction fixed.
hyperEM( data, theta_init_vec, squashed = TRUE, zeroes = FALSE, N_star = 1, method = c("score", "nlminb"), profile = c("parameter", "distribution"), LL_tol = 1e-04, consecutive = 100, param_lower = 1e-05, param_upper = 20, print_level = 2, max_iter = 5000, conf_ints = FALSE, conf_level = c("95", "80", "90", "99"), track = FALSE )
hyperEM( data, theta_init_vec, squashed = TRUE, zeroes = FALSE, N_star = 1, method = c("score", "nlminb"), profile = c("parameter", "distribution"), LL_tol = 1e-04, consecutive = 100, param_lower = 1e-05, param_upper = 20, print_level = 2, max_iter = 5000, conf_ints = FALSE, conf_level = c("95", "80", "90", "99"), track = FALSE )
data |
A data frame from |
theta_init_vec |
A numeric vector of initial hyperparameter guesses
ordered as: |
squashed |
A scalar logical ( |
zeroes |
A scalar logical specifying if zero counts are included. |
N_star |
A positive scalar whole number value for the minimum count
size to be used for hyperparameter estimation. If zeroes are used, set
|
method |
A scalar string indicating which method (i.e. score functions
or log-likelihood function) to use for the maximization steps. Possible
values are |
profile |
A scalar string indicating which method to use to optimize the
log-likelihood function if |
LL_tol |
A scalar numeric value for the tolerance used for determining when the change in log-likelihood at each iteration is sufficiently small. Used for convergence assessment. |
consecutive |
A positive scalar whole number value for the number of
consecutive iterations the change in log-likelihood must be below
|
param_lower |
A scalar numeric value for the smallest acceptable value
for each |
param_upper |
A scalar numeric value for the largest acceptable value
for each |
print_level |
A value that determines how much information is printed
during execution. Possible value are |
max_iter |
A positive scalar whole number value for the maximum number of iterations to use. |
conf_ints |
A scalar logical indicating if confidence intervals and standard errors should be returned. |
conf_level |
A scalar string for the confidence level used if confidence intervals are requested. |
track |
A scalar logical indicating whether or not to retain the hyperparameter estimates and log-likelihood value at each iteration. Can be used for plotting to better understand the algorithm's behavior. |
If method = "score"
, the maximization step finds a root
of the score function. If method = "nlminb"
,
nlminb
is used to find a minimum of the negative
log-likelihood function.
If method = "score"
and zeroes = FALSE
, then
'N_star'
must equal 1
.
If method = "score"
, the model is reparameterized. The
parameters are transformed to force the parameter space to include all real
numbers. This approach addresses numerical issues at the edge of the
parameter space. The reparameterization follows:
,
, and
. However, the values returned in
estimates
are on the original scale (back-transformed).
On every 100th iteration, the procedure described in Millar (2011)
is used to accelerate the estimate of .
The score vector and its Euclidean norm should be close to zero at a local maximum and can therefore be used to help assess the reliability of the results. A local maximum might not be the global MLE, however.
Asymptotic normal confidence intervals, if requested, use standard errors calculated from the observed Fisher information matrix as discussed in DuMouchel (1999).
A list including the following:
estimates: The maximum likelihood estimate (MLE) of the
hyperparameter, .
conf_int: A data frame including the standard errors and
confidence limits for estimates
. Only included if
conf_ints = TRUE
.
maximum: The log-likelihood function evaluated at
estimates
.
method: The method used in the maximization step.
elapsed: The elapsed function execution time in seconds.
iters: The number of iterations used.
score: The score functions (i.e. score vector) evaluated
at estimates
. All elements should be close to zero.
score_norm: The Euclidean norm of the score vector
evaluated at estimates
. Should be close to zero.
tracking: The estimates of at each iteration
and the log-likelihood function evaluated at those estimates. Unless
track = TRUE
, only shows the starting point of the algorithm.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
Meng X-L, Rubin D (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework", Biometrika, 80(2), 267-278.
Millar, Russell B (2011). "Maximum Likelihood Estimation and Inference", John Wiley & Sons, Ltd, 111-112.
uniroot
for finding a zero of the score
function and nlminb
for minimizing the negative
log-likelihood function
Other hyperparameter estimation functions:
autoHyper()
,
exploreHypers()
data.table::setDTthreads(2) #only needed for CRAN checks data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) hypers <- hyperEM(squashed, theta_init_vec = c(1, 1, 2, 2, .3), consecutive = 10, LL_tol = 1e-03)
data.table::setDTthreads(2) #only needed for CRAN checks data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) hypers <- hyperEM(squashed, theta_init_vec = c(1, 1, 2, 2, .3), consecutive = 10, LL_tol = 1e-03)
negLL
computes the negative log-likelihood based on the conditional
marginal distribution of the counts, N, given that N >= N*,
where N* is the smallest count used for estimating the hyperparameters
(DuMouchel et al. 2001). This function is minimized to estimate the
hyperparameters of the prior distribution. Use this function when neither
zero counts nor data squashing are being used. Generally this function is not
recommended unless using a small data set since data squashing (see
squashData
and negLLsquash
) can increase
efficiency (DuMouchel et al. 2001).
negLL(theta, N, E, N_star = 1)
negLL(theta, N, E, N_star = 1)
theta |
A numeric vector of hyperparameters ordered as:
|
N |
A whole number vector of actual counts from
|
E |
A numeric vector of expected counts from |
N_star |
A scalar whole number for the minimum count size used. |
The conditional marginal distribution for the counts, N,
given that N >= N*, is based on a mixture of two negative binomial
distributions. The hyperparameters for the prior distribution (mixture of
gammas) are estimated by optimizing the likelihood equation from this
conditional marginal distribution. It is recommended to use N_star =
1
when practical.
The hyperparameters are:
: Parameters of the first component of the
marginal distribution of the counts (also the prior distribution)
: Parameters of the second component
: Mixture fraction
This function will not need to be called directly if using
exploreHypers
or autoHyper
.
A scalar negative log-likelihood value
Make sure N_star matches the smallest actual count in N before using this function. Filter N and E if needed.
Make sure the data were not squashed before using this function.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
nlm
, nlminb
, and
optim
for optimization
Other negative log-likelihood functions:
negLLsquash()
,
negLLzeroSquash()
,
negLLzero()
negLLsquash
computes the negative log-likelihood based on the
conditional marginal distribution of the counts, N, given that
N >= N*, where N* is the smallest count used for estimating the
hyperparameters. This function is minimized to estimate the hyperparameters
of the prior distribution. Use this function when zero counts are not used
and data squashing is used as described by DuMouchel et al. (2001). This
function is the likelihood function that should usually be chosen.
negLLsquash(theta, ni, ei, wi, N_star = 1)
negLLsquash(theta, ni, ei, wi, N_star = 1)
theta |
A numeric vector of hyperparameters ordered as:
|
ni |
A whole number vector of squashed actual counts from
|
ei |
A numeric vector of squashed expected counts from
|
wi |
A whole number vector of bin weights from |
N_star |
A scalar whole number for the minimum count size used. |
The conditional marginal distribution for the counts, N,
given that N >= N*, is based on a mixture of two negative binomial
distributions. The hyperparameters for the prior distribution (mixture of
gammas) are estimated by optimizing the likelihood equation from this
conditional marginal distribution. It is recommended to use N_star =
1
when practical.
The hyperparameters are:
: Parameters of the first component of the
marginal distribution of the counts (also the prior distribution)
: Parameters of the second component
: Mixture fraction
This function will not need to be called directly if using
exploreHypers
or autoHyper
.
A scalar negative log-likelihood value
Make sure N_star matches the smallest actual count in ni before using this function. Filter ni, ei, and wi if needed.
Make sure the data were actually squashed (see squashData
)
before using this function.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
nlm
, nlminb
, and
optim
for optimization and squashData
for data squashing
Other negative log-likelihood functions:
negLLzeroSquash()
,
negLLzero()
,
negLL()
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- c(1, 1, 3, 3, .2) #initial guess data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) negLLsquash(theta = theta_init, ni = squashed$N, ei = squashed$E, wi = squashed$weight) #For hyperparameter estimation... stats::nlminb(start = theta_init, objective = negLLsquash, ni = squashed$N, ei = squashed$E, wi = squashed$weight)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- c(1, 1, 3, 3, .2) #initial guess data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) negLLsquash(theta = theta_init, ni = squashed$N, ei = squashed$E, wi = squashed$weight) #For hyperparameter estimation... stats::nlminb(start = theta_init, objective = negLLsquash, ni = squashed$N, ei = squashed$E, wi = squashed$weight)
negLLzero
computes the negative log-likelihood based on the
unconditional marginal distribution of N (equation 12 in DuMouchel
1999, except taking negative natural log). This function is minimized to
estimate the hyperparameters of the prior distribution. Use this function if
including zero counts but not squashing data. Generally this function is not
recommended (negLLsquash
is typically more efficient).
negLLzero(theta, N, E)
negLLzero(theta, N, E)
theta |
A numeric vector of hyperparameters ordered as:
|
N |
A whole number vector of actual counts from
|
E |
A numeric vector of expected counts from |
The marginal distribution of the counts, N, is a mixture of two negative binomial distributions. The hyperparameters for the prior distribution (mixture of gammas) are estimated by optimizing the likelihood equation from this marginal distribution.
The hyperparameters are:
: Parameters of the first component of the
marginal distribution of the counts (also the prior distribution)
: Parameters of the second component
: Mixture fraction
This function will not need to be called directly if using
exploreHypers
or autoHyper
.
A scalar negative log-likelihood value.
Make sure N actually contains zeroes before using this function. You
should have used the zeroes = TRUE
option when calling the
processRaw
function.
Make sure the data were not squashed before using this function.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
nlm
, nlminb
, and
optim
for optimization
Other negative log-likelihood functions:
negLLsquash()
,
negLLzeroSquash()
,
negLL()
negLLzeroSquash
computes the negative log-likelihood based on the
unconditional marginal distribution of N (DuMouchel et al. 2001).
This function is minimized to estimate the hyperparameters of the prior
distribution. Use this function if including zero counts and using data
squashing. Generally this function is not recommended unless convergence
issues occur without zero counts (negLLsquash
is typically more
efficient).
negLLzeroSquash(theta, ni, ei, wi)
negLLzeroSquash(theta, ni, ei, wi)
theta |
A numeric vector of hyperparameters ordered as:
|
ni |
A whole number vector of squashed actual counts from
|
ei |
A numeric vector of squashed expected counts from
|
wi |
A whole number vector of bin weights from |
The marginal distribution of the counts, N, is a mixture of two negative binomial distributions. The hyperparameters for the prior distribution (mixture of gammas) are estimated by optimizing the likelihood equation from this marginal distribution.
The hyperparameters are:
: Parameters of the first component of the
marginal distribution of the counts (also the prior distribution)
: Parameters of the second component
: Mixture fraction
This function will not need to be called directly if using
exploreHypers
or autoHyper
.
A scalar negative log-likelihood value.
Make sure ni actually contains zeroes before using this function.
You should have used the zeroes = TRUE
option when calling the
processRaw
function.
Make sure the data were actually squashed (see squashData
)
before using this function.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
nlm
, nlminb
, and
optim
for optimization and squashData
for data squashing
Other negative log-likelihood functions:
negLLsquash()
,
negLLzero()
,
negLL()
Plot an openEBGM object
## S3 method for class 'openEBGM' plot(x, y = NULL, event = NULL, plot.type = "bar", ...)
## S3 method for class 'openEBGM' plot(x, y = NULL, event = NULL, plot.type = "bar", ...)
x |
An openEBGM object constructed by |
y |
Unused parameter to satisfy generic function requirement |
event |
An (optional) specification of an event to subset the data by. |
plot.type |
A character vector specifying which type of plot should be output. See details. |
... |
Arguments to be passed to methods |
There are three different types of plots that the plot function may produce when called on an openEBGM object. These are
bar
shrinkage
histogram
A bar chart displays the top ten product-symptom EBGM scores, as well as error bars which display the highest and lowest of the quantiles chosen at the time of instantiating the openEBGM object. A shrinkage plot plots EBGM score on the y axis, and the natural log of the RR on the x axis. This plot is also called a squid plot and was conceived by Stuart Chirtel. Finally, a histogram simply displays a histogram of the EBGM scores.
Print an openEBGM object
## S3 method for class 'openEBGM' print(x, threshold = 2, ...)
## S3 method for class 'openEBGM' print(x, threshold = 2, ...)
x |
An openEBGM object constructed by |
threshold |
A numeric value indicating the minimum threshold for QUANT or EBGM values to display. |
... |
Arguments to be passed to other methods |
processRaw
finds the actual and expected counts using the methodology
described by DuMouchel (1999); however, an adjustment is applied to expected
counts to prevent double-counting (i.e., using unique marginal ID counts
instead of contingency table marginal totals). Also calculates the relative
reporting ratio (RR) and the proportional reporting ratio
(PRR).
processRaw( data, stratify = FALSE, zeroes = FALSE, digits = 2, max_cats = 10, list_ids = FALSE )
processRaw( data, stratify = FALSE, zeroes = FALSE, digits = 2, max_cats = 10, list_ids = FALSE )
data |
A data frame containing columns named: id, var1, and var2. Possibly includes columns for stratification variables identified by the substring 'strat' (e.g. strat_age). Other columns will be ignored. |
stratify |
A logical scalar (TRUE or FALSE) specifying if stratification is to be used. |
zeroes |
A logical scalar specifying if zero counts should be included. Using zero counts is only recommended for small data sets because it will dramatically increase run time. |
digits |
A whole number scalar specifying how many decimal places to use for rounding RR and PRR. |
max_cats |
A whole number scalar specifying the maximum number of categories to allow in any given stratification variable. Used to help prevent a situation where the user forgets to categorize a continuous variable, such as age. |
list_ids |
A logical scalar specifying if a column for pipe-concatenated IDs should be returned. |
An id column must be included in data
. If your data
set does not include IDs, make a column of unique IDs using df$id <-
1:nrow(df)
. However, unique IDs should only be constructed if the cells in
the contingency table are mutually exclusive. For instance, unique IDs for
each row in data
are not appropriate with CAERS data since a given
report can include multiple products and/or adverse events.
Stratification variables are identified by searching for the substring 'strat'. Only variables containing 'strat' (case sensitive) will be used as stratification variables. PRR calculations ignore stratification, but E and RR calculations are affected. A warning will be displayed if any stratum contains less than 50 unique IDs.
If a PRR calculation results in division by zero, Inf
is returned.
A data frame with actual counts (N), expected counts
(E), relative reporting ratio (RR), and proportional
reporting ratio (PRR) for var1-var2 pairs. Also includes a
column for IDs (ids) if list_ids = TRUE
.
Use of the zeroes = TRUE
option will result in a considerable
increase in runtime. Using zero counts is not recommended if the contingency
table is moderate or large in size (~500K cells or larger). However, using
zeroes could be useful if the optimization algorithm fails to converge
when estimating hyperparameters.
Any columns in data
containing the substring 'strat' in the
column name will be considered stratification variables, so verify that you
do not have any extraneous columns with that substring.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
data.table::setDTthreads(2) #only needed for CRAN checks var1 <- c("product_A", rep("product_B", 3), "product_C", rep("product_A", 2), rep("product_B", 2), "product_C") var2 <- c("event_1", rep("event_2", 2), rep("event_3", 2), "event_2", rep("event_3", 3), "event_1") strat1 <- c(rep("Male", 5), rep("Female", 3), rep("Male", 2)) strat2 <- c(rep("age_cat1", 5), rep("age_cat1", 3), rep("age_cat2", 2)) dat <- data.frame( var1 = var1, var2 = var2, strat1 = strat1, strat2 = strat2, stringsAsFactors = FALSE ) dat$id <- 1:nrow(dat) processRaw(dat) suppressWarnings( processRaw(dat, stratify = TRUE) ) processRaw(dat, zeroes = TRUE) suppressWarnings( processRaw(dat, stratify = TRUE, zeroes = TRUE) ) processRaw(dat, list_ids = TRUE)
data.table::setDTthreads(2) #only needed for CRAN checks var1 <- c("product_A", rep("product_B", 3), "product_C", rep("product_A", 2), rep("product_B", 2), "product_C") var2 <- c("event_1", rep("event_2", 2), rep("event_3", 2), "event_2", rep("event_3", 3), "event_1") strat1 <- c(rep("Male", 5), rep("Female", 3), rep("Male", 2)) strat2 <- c(rep("age_cat1", 5), rep("age_cat1", 3), rep("age_cat2", 2)) dat <- data.frame( var1 = var1, var2 = var2, strat1 = strat1, strat2 = strat2, stringsAsFactors = FALSE ) dat$id <- 1:nrow(dat) processRaw(dat) suppressWarnings( processRaw(dat, stratify = TRUE) ) processRaw(dat, zeroes = TRUE) suppressWarnings( processRaw(dat, stratify = TRUE, zeroes = TRUE) ) processRaw(dat, list_ids = TRUE)
Qn
calculates , the posterior probability that
came from the first component of the mixture, given N = n (Eq. 6,
DuMouchel 1999).
is the mixture fraction for the posterior
distribution.
Qn(theta_hat, N, E)
Qn(theta_hat, N, E)
theta_hat |
A numeric vector of hyperparameter estimates (likely from
|
N |
A whole number vector of actual counts from
|
E |
A numeric vector of expected counts from |
The hyperparameter estimates (theta_hat
) are:
: Parameter estimates of the first
component of the prior distribution
: Parameter estimates of the second
component
: Mixture fraction estimate of the prior distribution
A numeric vector of probabilities.
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
autoHyper
, exploreHypers
,
negLLsquash
, negLL
,
negLLzero
, and negLLzeroSquash
for
hyperparameter estimation.
processRaw
for finding counts.
Other posterior distribution functions:
ebgm()
,
quantBisect()
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) head(qn)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) head(qn)
quantBisect
finds the desired quantile of the posterior distribution
using the bisection method. Used to create credibility limits.
quantBisect( percent, theta_hat, N, E, qn, digits = 2, limits = c(-1e+05, 1e+05), max_iter = 2000 )
quantBisect( percent, theta_hat, N, E, qn, digits = 2, limits = c(-1e+05, 1e+05), max_iter = 2000 )
percent |
A numeric scalar between 1 and 99 for the desired percentile (e.g., 5 for 5th percentile). |
theta_hat |
A numeric vector of hyperparameter estimates (likely from
|
N |
A whole number vector of actual counts from
|
E |
A numeric vector of expected counts from |
qn |
A numeric vector of posterior probabilities that |
digits |
A scalar whole number that determines the number of decimal places used when rounding the results. |
limits |
A whole number vector of length 2 for the upper and lower bounds of the search space. |
max_iter |
A whole number scalar for the maximum number of iterations. Used to prevent infinite loops. |
The hyperparameter estimates (theta_hat
) are:
: Parameter estimates of the first
component of the prior distribution
: Parameter estimates of the second
component
: Mixture fraction estimate of the prior distribution
Although this function can find any quantile of the posterior distribution, it will often be used to calculate the 5th and 95th percentiles to create a 90% credibility interval.
The quantile is calculated by solving for in the general
equation
, or equivalently,
,
where
is the cumulative distribution function of the posterior
distribution and
is the appropriate cutoff level (e.g., 0.05
for the 5th percentile).
A numeric vector of quantile estimates.
The digits
argument determines the tolerance for the bisection
algorithm. The more decimal places you want returned, the longer the run
time.
https://en.wikipedia.org/wiki/Bisection_method
autoHyper
, exploreHypers
,
negLLsquash
, negLL
,
negLLzero
, and negLLzeroSquash
for
hyperparameter estimation.
processRaw
for finding counts.
Qn
for finding mixture fractions.
Other posterior distribution functions:
Qn()
,
ebgm()
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) proc$QUANT_05 <- quantBisect(percent = 5, theta = theta_hat, N = proc$N, E = proc$E, qn = qn) ## Not run: proc$QUANT_95 <- quantBisect(percent = 95, theta = theta_hat, N = proc$N, E = proc$E, qn = qn) ## End(Not run) head(proc)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) theta_hat <- autoHyper(data = squashed, theta_init = theta_init)$estimates qn <- Qn(theta_hat, N = proc$N, E = proc$E) proc$QUANT_05 <- quantBisect(percent = 5, theta = theta_hat, N = proc$N, E = proc$E, qn = qn) ## Not run: proc$QUANT_95 <- quantBisect(percent = 95, theta = theta_hat, N = proc$N, E = proc$E, qn = qn) ## End(Not run) head(proc)
squashData
squashes data by binning expected counts, E, for a
given actual count, N, using bin means as the expected counts for
the reduced data set. The squashed points are weighted by bin size. Data
can be squashed to reduce computational burden (see DuMouchel et al.,
2001) when estimating the hyperparameters.
squashData( data, count = 1, bin_size = 50, keep_pts = 100, min_bin = 50, min_pts = 500 )
squashData( data, count = 1, bin_size = 50, keep_pts = 100, min_bin = 50, min_pts = 500 )
data |
A data frame (typically from |
count |
A non-negative scalar whole number for the count size, N, used for binning |
bin_size |
A scalar whole number (>= 2) |
keep_pts |
A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”. |
min_bin |
A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”. |
min_pts |
A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”. |
Can be used iteratively (count = 1, then 2, etc.).
The N column in data
will be coerced using
as.integer
, and E will be coerced using
as.numeric
. Missing data are not allowed.
Since the distribution of expected counts, E, tends to be
skewed to the right, the largest Es are not squashed by default.
This behavior can be changed by setting the keep_pts
argument to
zero (0); however, this is not recommended. Squashing the largest Es
could result in a large loss of information, so it is recommended to use a
value of 100 or more for keep_pts
.
Values for keep_pts
, min_bin
, and min_pts
should typically be at least as large as the default values.
A data frame with column names N, E, and weight containing the reduced data set.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
processRaw
for data preparation and
autoSquash
for automatically squashing an entire data set in
one function call
set.seed(483726) dat <- data.frame( var1 = letters[1:26], var2 = LETTERS[1:26], N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)), E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3), stringsAsFactors = FALSE ) (zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1, min_bin = 2, min_pts = 2)) (ones <- squashData(zeroes, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2)) (twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2)) squashData(zeroes, bin_size = 2, keep_pts = 0, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 2, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 3, min_bin = 2, min_pts = 2)
set.seed(483726) dat <- data.frame( var1 = letters[1:26], var2 = LETTERS[1:26], N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)), E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3), stringsAsFactors = FALSE ) (zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1, min_bin = 2, min_pts = 2)) (ones <- squashData(zeroes, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2)) (twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2)) squashData(zeroes, bin_size = 2, keep_pts = 0, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 1, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 2, min_bin = 2, min_pts = 2) squashData(zeroes, bin_size = 2, keep_pts = 3, min_bin = 2, min_pts = 2)
Summarize an openEBGM object
## S3 method for class 'openEBGM' summary(object, plot.out = TRUE, log.trans = FALSE, ...)
## S3 method for class 'openEBGM' summary(object, plot.out = TRUE, log.trans = FALSE, ...)
object |
An openEBGM object constructed by |
plot.out |
A logical value indicating whether or not a histogram of the EBGM scores should be displayed |
log.trans |
A logical value indicating whether or not the data should be log-transformed. |
... |
Additional arguments affecting the summary produced |
This function provides a brief summary of the results of the
calculations performed in the ebScores
function. In
particular, it provides the numerical summary of the EBGM and
QUANT_* vectors.
Additionally, calling summary
on an openEBGM
object will produce a histogram of the EBGM scores. By setting the
log.trans parameter to TRUE, one can convert the EBGM score to
EBlog2, which is a Bayesian version of the information criterion
(DuMouchel).
DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(data = squashed, theta_init = theta_init) ) ebout <- ebScores(processed = proc, hyper_estimate = hypers, quantiles = 5) summary(ebout) ## Not run: summary(ebout, plot.out = FALSE) ## Not run: summary(ebout, log.trans = TRUE)
data.table::setDTthreads(2) #only needed for CRAN checks theta_init <- data.frame( alpha1 = c(0.5, 1), beta1 = c(0.5, 1), alpha2 = c(2, 3), beta2 = c(2, 3), p = c(0.1, 0.2) ) data(caers) proc <- processRaw(caers) squashed <- squashData(proc, bin_size = 300, keep_pts = 10) squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10) suppressWarnings( hypers <- autoHyper(data = squashed, theta_init = theta_init) ) ebout <- ebScores(processed = proc, hyper_estimate = hypers, quantiles = 5) summary(ebout) ## Not run: summary(ebout, plot.out = FALSE) ## Not run: summary(ebout, log.trans = TRUE)