Title: | Accuracy Statistic Estimation for Imperfect Gold Standards |
---|---|
Description: | Produce maximum likelihood estimates of common accuracy statistics for multiple measurement methods when a gold standard is not available. An R implementation of the expectation maximization algorithms described in Zhou et al. (2011) <doi:10.1002/9780470906514> with additional functions for creating simulated data and visualizing results. Supports binary, ordinal, and continuous measurement methods. |
Authors: | Corie Drake [aut, cre, cph] |
Maintainer: | Corie Drake <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.5.1 |
Built: | 2024-12-17 06:38:45 UTC |
Source: | CRAN |
boot_ML()
is a function used to generate bootstrap estimates of results generated
by estimate_ML()
primarily for use in creating nonparametric confidence intervals.
boot_ML( type = c("binary", "ordinal", "continuous"), data, n_boot = 100, max_iter = 1000, tol = 1e-07, seed = NULL, ... )
boot_ML( type = c("binary", "ordinal", "continuous"), data, n_boot = 100, max_iter = 1000, tol = 1e-07, seed = NULL, ... )
type |
A string specifying the data type of the methods under evaluation. |
data |
An |
n_boot |
number of bootstrap estimates to compute |
max_iter |
The maximum number of EM algorithm iterations to compute before reporting a result. |
tol |
The minimum change in statistic estimates needed to continue iterating the EM algorithm. |
seed |
optional seed for RNG |
... |
Additional arguments |
a list containing accuracy estimates, v
, and the parameters used.
v_0 |
result from original data |
v_star |
list containing results from each bootstrap resampling |
params |
list containing the parameters used |
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # Bootstrap ML results boot_ex <- boot_ML( "binary", data = my_sim$generated_data, n_boot = 20)
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # Bootstrap ML results boot_ex <- boot_ML( "binary", data = my_sim$generated_data, n_boot = 20)
Censor data randomly rowwise
censor_data( n_obs = dis$n_obs, first_reads_all = first_reads_all, n_method_subset = n_method_subset, n_method = n_method )
censor_data( n_obs = dis$n_obs, first_reads_all = first_reads_all, n_method_subset = n_method_subset, n_method = n_method )
n_obs |
An integer representing the number of observations to simulate. |
first_reads_all |
Used for binary methods. A logical which forces method 1 to have a result for every observation |
n_method_subset |
Used for binary methods. An integer defining how many methods to select at random to produce a result for each observation |
n_method |
An integer representing the number of methods to simulate. |
Define the True disease state of a simulated sample
define_disease_state(D = NULL, n_obs = NULL, prev = NULL)
define_disease_state(D = NULL, n_obs = NULL, prev = NULL)
D |
Optional binary vector representing the true classification of each observation. |
n_obs |
An integer representing the number of observations to simulate. |
prev |
A value between 0-1 which represents the proportion of "positive" results in the target population. |
A list of features defining the true disease status of each observation
estimate_ML()
is a general function for estimating the maximum likelihood accuracy
statistics for a set of methods with no known reference value, i.e. "truth", or
"gold standard".
estimate_ML( type = c("binary", "ordinal", "continuous"), data, init = list(NULL), max_iter = 1000, tol = 1e-07, save_progress = TRUE, ... ) estimate_ML_binary( data, init = list(prev_1 = NULL, se_1 = NULL, sp_1 = NULL), max_iter = 100, tol = 1e-07, save_progress = TRUE ) estimate_ML_continuous( data, init = list(prev_1 = NULL, mu_i1_1 = NULL, sigma_i1_1 = NULL, mu_i0_1 = NULL, sigma_i0_1 = NULL), max_iter = 100, tol = 1e-07, save_progress = TRUE ) estimate_ML_ordinal( data, init = list(pi_1_1 = NULL, phi_1ij_1 = NULL, phi_0ij_1 = NULL, n_level = NULL), level_names = NULL, max_iter = 1000, tol = 1e-07, save_progress = TRUE )
estimate_ML( type = c("binary", "ordinal", "continuous"), data, init = list(NULL), max_iter = 1000, tol = 1e-07, save_progress = TRUE, ... ) estimate_ML_binary( data, init = list(prev_1 = NULL, se_1 = NULL, sp_1 = NULL), max_iter = 100, tol = 1e-07, save_progress = TRUE ) estimate_ML_continuous( data, init = list(prev_1 = NULL, mu_i1_1 = NULL, sigma_i1_1 = NULL, mu_i0_1 = NULL, sigma_i0_1 = NULL), max_iter = 100, tol = 1e-07, save_progress = TRUE ) estimate_ML_ordinal( data, init = list(pi_1_1 = NULL, phi_1ij_1 = NULL, phi_0ij_1 = NULL, n_level = NULL), level_names = NULL, max_iter = 1000, tol = 1e-07, save_progress = TRUE )
type |
A string specifying the data type of the methods under evaluation. |
data |
An |
init |
An optional list of initial values used to seed the EM algorithm.
If initial values are not provided, the |
max_iter |
The maximum number of EM algorithm iterations to compute before reporting a result. |
tol |
The minimum change in statistic estimates needed to continue iterating the EM algorithm. |
save_progress |
A logical indication of whether to save interim calculations used in the EM algorithm. |
... |
Additional arguments |
level_names |
An optional, ordered, character vector of unique names corresponding to the levels of the methods. |
The lack of an infallible reference method is referred to
as an imperfect gold standard (GS). Accuracy statistics which rely on a GS
method, such as sensitivity, specificity, and AUC,
can be estimated using imperfect gold standards by iteratively estimating the
maximum likelihood values of these statistics while the conditional independence
assumption holds. estimate_ML()
relies on a collection of expectation maximization (EM) algorithms
to achieve this. The EM algorithms used in this function are based on those presented in
Statistical Methods in Diagnostic Medicine, Second Edition
(Zhou et al. 2011) and have been validated on
several examples therein. Additional details about these algorithms can be found
for binary (Walter and Irwig 1988), ordinal (Zhou et al. 2005),
and continuous (Hsieh et al. 2009) methods.
Minor changes to the literal calculations have been
made for efficiency, code readability, and the like, but the underlying steps
remain functionally unchanged.
estimate_ML()
returns an S4 object of class "MultiMethodMLEstimate"
containing the maximum likelihood accuracy statistics calculated by EM.
Zhou X, Obuchowski NA, McClish DK (2011). Statistical Methods in Diagnostic Medicine. Wiley. doi:10.1002/9780470906514.
Walter SD, Irwig LM (1988). “Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review.” J. Clin. Epidemiol., 41(9), 923–937. doi:10.1016/0895-4356(88)90110-2.
Zhou X, Castelluccio P, Zhou C (2005). “Nonparametric estimation of ROC curves in the absence of a gold standard.” Biometrics, 61(2), 600–609. doi:10.1111/j.1541-0420.2005.00324.x.
Hsieh H, Su H, Zhou X (2009). “Interval Estimation for the Difference in Paired Areas under the ROC Curves in the Absense of a Gold Standard Test.” Statistics in Medicine. https://doi.org/10.1002/sim.3661.
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
generate_multimethod_data()
is a general function for creating a data set which
simulates the results one might see when using several different methods to measure a set
of objects.
generate_multimethod_data( type = c("binary", "ordinal", "continuous"), n_method = 3, n_obs = 100, prev = 0.5, D = NULL, method_names = NULL, obs_names = NULL, ... ) generate_multimethod_binary( n_method = 3, n_obs = 100, prev = 0.5, D = NULL, se = rep(0.9, n_method), sp = rep(0.9, n_method), method_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE ) generate_multimethod_ordinal( n_method = 3, n_obs = 100, prev = 0.5, D = NULL, n_level = 5, pmf_pos = matrix(rep(1:n_level - 1, n_method), nrow = n_method, byrow = TRUE), pmf_neg = matrix(rep(n_level:1 - 1, n_method), nrow = n_method, byrow = TRUE), method_names = NULL, level_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE ) generate_multimethod_continuous( n_method = 2, n_obs = 100, prev = 0.5, D = NULL, mu_i1 = rep(12, n_method), sigma_i1 = diag(n_method), mu_i0 = rep(10, n_method), sigma_i0 = diag(n_method), method_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE )
generate_multimethod_data( type = c("binary", "ordinal", "continuous"), n_method = 3, n_obs = 100, prev = 0.5, D = NULL, method_names = NULL, obs_names = NULL, ... ) generate_multimethod_binary( n_method = 3, n_obs = 100, prev = 0.5, D = NULL, se = rep(0.9, n_method), sp = rep(0.9, n_method), method_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE ) generate_multimethod_ordinal( n_method = 3, n_obs = 100, prev = 0.5, D = NULL, n_level = 5, pmf_pos = matrix(rep(1:n_level - 1, n_method), nrow = n_method, byrow = TRUE), pmf_neg = matrix(rep(n_level:1 - 1, n_method), nrow = n_method, byrow = TRUE), method_names = NULL, level_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE ) generate_multimethod_continuous( n_method = 2, n_obs = 100, prev = 0.5, D = NULL, mu_i1 = rep(12, n_method), sigma_i1 = diag(n_method), mu_i0 = rep(10, n_method), sigma_i0 = diag(n_method), method_names = NULL, obs_names = NULL, n_method_subset = n_method, first_reads_all = FALSE )
type |
A string specifying the data type of the methods being simulated. |
n_method |
An integer representing the number of methods to simulate. |
n_obs |
An integer representing the number of observations to simulate. |
prev |
A value between 0-1 which represents the proportion of "positive" results in the target population. |
D |
Optional binary vector representing the true classification of each observation. |
method_names |
Optional vector of names used to identify each method. |
obs_names |
Optional vector of names used to identify each observation. |
... |
Additional parameters |
se , sp
|
Used for binary methods. A vector of length n_method of values between 0-1 representing the sensitivity and specificity of the methods. |
n_method_subset |
Used for binary methods. An integer defining how many methods to select at random to produce a result for each observation |
first_reads_all |
Used for binary methods. A logical which forces method 1 to have a result for every observation |
n_level |
Used for ordinal methods. An integer representing the number of ordinal levels each method has |
pmf_pos , pmf_neg
|
Used for ordinal methods. A n_method by n_level matrix representing the probability mass functions for positive and negative results, respectively |
level_names |
Used for ordinal methods. Optional vector of names used to identify each level |
mu_i1 , mu_i0
|
Used for continuous methods. Vectors of length n_method of the method mean values for positive (negative) observations |
sigma_i1 , sigma_i0
|
Used for continuous methods. Covariance matrices of method positive (negative) observations |
The function supports binary measurement methods, e.g., Pass/Fail; ordinal measurement methods, e.g., the Likert scale; and continuous measurement methods, e.g., height. The data are generated under the assumption that the underlying population consists of a mixture of two groups. The primary application of this is to simulate a sample from a population which has some prevalence of disease.
A list containing a simulated data set and the parameters used to create it
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
S4 object containing the results of multi-method ML accuracy estimates
results
a list of estimated accuracy statistics
names
a list containing vectors of names of various dimensions
data
a copy of the data used to generate the estimated values
iter
an integer number of iterations needed for the EM algorithm to converge
prog
a list containing the values calculated during each iteration of the EM algorithm
type
a string describing the data type
Create unique names for a set of things
name_thing(thing = "", n = 1)
name_thing(thing = "", n = 1)
thing |
a string that describes the set of items to name |
n |
an integer number of unique names to create |
a vector of unique names
plot_ML()
is a general function for visualizing results generated by estimate_ML()
.
plot_ML(ML_est, params = NULL) plot_ML_binary( ML_est, params = list(prev = NULL, se = NULL, sp = NULL, D = NULL) ) plot_ML_ordinal( ML_est, params = list(pi_1_1 = NULL, phi_1ij_1 = NULL, phi_0ij_1 = NULL, D = NULL) ) plot_ML_continuous( ML_est, params = list(prev_1 = NULL, mu_i1_1 = NULL, sigma_i1_1 = NULL, mu_i0_1 = NULL, sigma_i0_1 = NULL, D = NULL) )
plot_ML(ML_est, params = NULL) plot_ML_binary( ML_est, params = list(prev = NULL, se = NULL, sp = NULL, D = NULL) ) plot_ML_ordinal( ML_est, params = list(pi_1_1 = NULL, phi_1ij_1 = NULL, phi_0ij_1 = NULL, D = NULL) ) plot_ML_continuous( ML_est, params = list(prev_1 = NULL, mu_i1_1 = NULL, sigma_i1_1 = NULL, mu_i0_1 = NULL, sigma_i0_1 = NULL, D = NULL) )
ML_est |
A MultiMethodMLEstimate class object |
params |
A list of population parameters. This is primarily used to evaluate results from a simulation where the target parameters are known, but can be used to visualize results with respect to some True value. |
A list of ggplot2 plots.
Binary:
prev |
A plot showing how the prevalence estimate changes with each iteration of the EM algorithm |
se |
A plot showing how the sensitivity estimates of each method change with each iteration of the EM algorithm |
sp |
A plot showing how the specificity estimates of each method change with each iteration of the EM algorithm |
qk |
A plot showing how the q values for each observation k change over each iteration of the EM algorithm |
qk_hist |
A histogram of q values. Observations, k, can be colored by True
state if it is passed by |
se_sp |
A plot showing the path the sensitivity and specificity estimates
for each method follows during the EM algorithm. True sensitivity and specificity
values can be passed by |
Ordinal:
ROC |
The Receiver Operator Characteristic (ROC) curves estimated for each method |
q_k1 |
A plot showing how the q values for each observation, k, change when d=1
over each iteration of the EM algorithm. Observations can be colored by True
state if it is passed ( |
q_k0 |
A plot showing how the q values for each observation, k, change when d=0
over each iteration of the EM algorithm. Observations can be colored by True
state if it is passed by |
q_k1_hist |
A histogram of q_1 values. Observations, k, can be colored by True
state if it is passed by |
phi_d |
A stacked bar graph representing the estimated CMFs of each
method when |
Continuous:
ROC |
The Receiver Operator Characteristic (ROC) curves estimated for each method |
z_k1 |
A plot showing how the z_k1 values for each observation change
over each iteration of the EM algorithm. Observations can be colored by True
state if it is passed ( |
z_k0 |
A plot showing how the z_k0 values for each observation change
over each iteration of the EM algorithm. Observations can be colored by True
state if it is passed ( |
z_k1_hist |
A histogram of z_k1 values. Observations can be colored by True
state if it is passed ( |
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
# Set seed for this example set.seed(11001101) # Generate data for 4 binary methods my_sim <- generate_multimethod_data( "binary", n_obs = 75, n_method = 4, se = c(0.87, 0.92, 0.79, 0.95), sp = c(0.85, 0.93, 0.94, 0.80), method_names = c("alpha", "beta", "gamma", "delta")) # View the data my_sim$generated_data # View the parameters used to generate the data my_sim$params # Estimate ML accuracy values by EM algorithm my_result <- estimate_ML( "binary", data = my_sim$generated_data, save_progress = FALSE # this reduces the data stored in the resulting object ) # View results of ML estimate my_result@results
Create a list of plots visualizing the expectation maximization process and resulting accuracy statistics stored in a MultiMethodMLEstimate object.
## S4 method for signature 'MultiMethodMLEstimate' plot(x, y, ...)
## S4 method for signature 'MultiMethodMLEstimate' plot(x, y, ...)
x |
a MultiMethodMLEstimate S4 object |
y |
not used |
... |
Arguments passed on to
|
A list of ggplot2 plots
pollinate_ML()
is a general helper function which can be used to generate starting
values, i.e. seeds, for the estimate_ML()
function from a multi-method data set.
pollinate_ML(type = c("binary", "ordinal", "continuous"), data, ...) pollinate_ML_binary(data, ...) pollinate_ML_ordinal( data, n_level = NULL, threshold_level = ceiling(n_level/2), level_names = NULL ) pollinate_ML_continuous( data, prev = 0.5, q_seeds = c((1 - prev)/2, 1 - (prev/2)), high_pos = TRUE )
pollinate_ML(type = c("binary", "ordinal", "continuous"), data, ...) pollinate_ML_binary(data, ...) pollinate_ML_ordinal( data, n_level = NULL, threshold_level = ceiling(n_level/2), level_names = NULL ) pollinate_ML_continuous( data, prev = 0.5, q_seeds = c((1 - prev)/2, 1 - (prev/2)), high_pos = TRUE )
type |
A string specifying the data type of the methods under evaluation. |
data |
An |
... |
Additional arguments |
n_level |
Used for ordinal methods. Integer number of levels each method contains |
threshold_level |
Used for ordinal methods. A value from 1 to |
level_names |
Used for ordinal methods. Optional vector of length |
prev |
A double between 0-1 representing the proportion of positives in the population |
q_seeds |
Used for continuous methods. A vector of length 2 representing the quantiles at which the two groups are assumed to be centered |
high_pos |
Used for continuous methods. A logical indicating whether larger values are considered "positive" |
a list of EM algorithm initialization values
Print the accuracy statistic estimates stored in a MultiMethodMLEstimate object.
## S4 method for signature 'MultiMethodMLEstimate' show(object)
## S4 method for signature 'MultiMethodMLEstimate' show(object)
object |
An object of class MultiMethodMLEstimate. |
A list containing relevant accuracy statistic estimates. This is a
subset of the list stored in results
slot of the MultiMethodMLEstimate object.