Title: | Power Analysis via Monte Carlo Simulation for Correlated Data |
---|---|
Description: | A flexible framework for power analysis using Monte Carlo simulation for settings in which considerations of the correlations between predictors are important. Users can set up a data generative model that preserves dependence structures among predictors given existing data (continuous, binary, or ordinal). Users can also generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This package includes several statistical models common in environmental mixtures studies. For more details and tutorials, see Nguyen et al. (2022) <arXiv:2209.08036>. |
Authors: | Phuc H. Nguyen [aut, cre] |
Maintainer: | Phuc H. Nguyen <[email protected]> |
License: | LGPL |
Version: | 0.1.0 |
Built: | 2024-12-03 06:46:27 UTC |
Source: | CRAN |
Fits a BKMR model with significance criteria: PIP and group-wise PIP
bkmr_wrapper(y, x, args = list())
bkmr_wrapper(y, x, args = list())
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments, see R function 'bkmr::kmbayes()' |
A list of vectors whose values are between 0 and 1
pip |
PIP for component-wise selection or conditional (with-in group) PIP for hierarchical variable selection. |
group_pip |
PIP for group-specific selection. |
time |
elapsed time to fit the model. |
Bobb JF, Henn BC, Valeri L, Coull BA (2018). “Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression.”Environ-mental Health,17(67).doi:10.1186/s12940-018-0413-y.
Fits a linear model with Bayesian model selection with significance criteria: PIP and posterior probability of nonzero coefficients being on one side of zero.
bma_wrapper(y, x, args = list())
bma_wrapper(y, x, args = list())
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments see R function 'BMA::bic.glm()'. |
A list of vectors whose values are between 0 and 1
beta |
The smaller posterior probability of the coefficients being to one side of zero: min(Pr(beta >0), Pr(beta<0)). |
pip |
PIP of the effect not being zero. |
time |
elapsed time to fit the model. |
Raftery A, Hoeting J, Volinsky C, Painter I, Yeung KY (2021).BMA: Bayesian model averaging. R package version 3.18.15.
Fits a Bayesian weighted sums
bws_wrapper(y, x, args = list(iter = 2000))
bws_wrapper(y, x, args = list(iter = 2000))
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments see R 'bws::bws()“ function. |
A list
beta |
The smaller posterior probability of the combined overall effect being to one side of zero: min(Pr(beta >0), Pr(beta<0)). The same for all predictor. |
weights |
The 95% CI of the contribution of each predictor to the overall effect. Between 0 and 1. |
time |
elapsed time to fit the model. |
Hamra GB, MacLehose RF, Croen L, Kauffman EM, Newschaffer C (2021). “Bayesian weighted sums: a flexible approach to estimate summed mixture effects.” International Journal of Environmental Research and Public Health, 18(4), 1373.
Convert a correlation matrix into a partial correlation matrix
cor2partial(r)
cor2partial(r)
r |
A correlation matrix |
A partial correlation matrix
Citation: Daniel Lewandowski, Dorota Kurowicka, Harry Joe, Generating random correlation matrices based on vines and extended onion method, Journal of Multivariate Analysis, Volume 100, Issue 9, 2009, Pages 1989-2001, ISSN 0047-259X, https://doi.org/10.1016/j.jmva.2009.04.008.
cvine(d, alpha = 10, beta = 10, S = NULL, m = 100)
cvine(d, alpha = 10, beta = 10, S = NULL, m = 100)
d |
Number of dimension |
alpha |
Parameter for Beta distribution |
beta |
Parameter for Beta distribution |
S |
A 'guess' of the correlation matrix |
m |
A number that indicates how much the random matrices vary from S |
A random positive-definite correlation matrix
Monte Carlo approximation of the SNR
estimate_snr(ymod, xmod, m = 5000, R = 100)
estimate_snr(ymod, xmod, m = 5000, R = 100)
ymod |
A OutcomeModel object |
xmod |
A MixtureModel object |
m |
Number of MC samples |
R |
Number of bootstrap replicates |
An estimate SNR and 95-percent CI.
Fits a Bayesian factor model with interactions
fin_wrapper(y, x, args = list(nrun = 2000))
fin_wrapper(y, x, args = list(nrun = 2000))
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments see R function ‘infinitefactor::interactionDL()' in ’infinitefactor' package. |
A list of vectors whose values are between 0 and 1
beta |
The smallest posterior probability of the coefficients being to one side of zero for either main effect or interaction: min(Pr(beta >0), Pr(beta<0)). |
linear_beta |
The smaller of posterior probability of the main effects being to one side of zero. |
interact_beta |
Same as linear_beta but for pair-wise interactions. |
time |
elapsed time to fit the model. |
Ferrari F, Dunson DB (2020). “Bayesian factor analysis for inference on interactions.”Journal of the American Statistical Association, 0(0), 1–12.
Fits the model to given data and gets a list of significance criteria
fit(mod, x, y)
fit(mod, x, y)
mod |
An InferenceModel object |
x |
A matrix of predictors |
y |
A vector of outcome |
A list of some of the following significance criteria:
beta |
The smaller posterior probability of being to one side of zero for linear term, given either the main effect or interaction is non-zero. Applicable to 'bma', 'bws', 'fin', and 'mixselect' model. |
interact_beta |
Same as linear_beta but for pair-wise interactions. Applicable to 'fin' model. |
pip |
Posterior inclusion probability (PIP) of either a linear or non-linear effect. Applicable to 'bma', 'bkmr', and 'mixselect' model. |
group_pip |
PIP of either a linear or non-linear effect. Applicable to 'bkmr' model. |
linear_pip |
PIP of a linear effect. Applicable to 'mixselect' model. |
gp_pip |
PIP of a non-linear effect. Applicable to 'mixselect' model. |
pval |
The p-value of the combined effect, the same for all predictors. Applicable to 'glm', and 'qgc' model. |
time |
elapsed time to fit the model. |
Generates a matrix of n observations of p predictors
genx(obj, n)
genx(obj, n)
obj |
A MixtureModel object. |
n |
An integer, number of observations to generate. |
A (n x p) dataframe.
Generates a vector of outcomes
geny(obj, x)
geny(obj, x)
obj |
An OutcomeModel object |
x |
An (n x p) matrix of predictors |
An n-vector of outcomes
Fits a generalized linear model
glm_wrapper(y, x, args = list())
glm_wrapper(y, x, args = list())
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments see R 'glm' function. |
A list
pval |
The p-value of the linear main effect. |
time |
elapsed time to fit the model. |
This function creates a wrapper function for a statistical model and its applicable significance criterion. It finds relationships between a matrix of predictors and a vector of outcomes using the statistical model, and determines if the relationships are 'significant' according to pre-specified criterion for that model.
InferenceModel(model, name = NULL, ...)
InferenceModel(model, name = NULL, ...)
model |
A string of the name of a built-in statistical model or a function implements a statistical model and returns a list of significance criteria for each predictor. Built-in options include 'bma', 'bkmr', 'mixselect', 'bws', 'qgc', 'fin', 'glm'. |
name |
A string, name of the statistical model. Default to string input of model. |
... |
Additional keyword arguments for the statistical model |
An InferenceModel object.
model |
a function that takes matrices of predictors and outcomes and returns a list of significance criteria. |
model_name |
a string. |
imod <- mpower::InferenceModel(model = 'glm', family = 'gaussian', formula = y ~ Poverty*(poly(Age, 2) + HHIncome + HomeOwn + Education))
imod <- mpower::InferenceModel(model = 'glm', family = 'gaussian', formula = y ~ Poverty*(poly(Age, 2) + HHIncome + HomeOwn + Education))
This function creates a generative model for the correlated, mixed-scale predictors.
MixtureModel( method = "estimation", data = NULL, G = NULL, m = 100, nudge = 1e-09, sbg_args = list(nsamp = 1000), cvine_marginals = list(), cvine_dtypes = list(), resamp_prob = NULL )
MixtureModel( method = "estimation", data = NULL, G = NULL, m = 100, nudge = 1e-09, sbg_args = list(nsamp = 1000), cvine_marginals = list(), cvine_dtypes = list(), resamp_prob = NULL )
method |
A string, one of the three options "resampling", "estimation", or "cvine". Default is "estimation". See Details. |
data |
A dataframe or matrix, required for resampling" and "estimation" method. |
G |
A guesstimate pairwise correlation matrix for "cvine" method. See Details. |
m |
A positive number indicating uncertainty in the guesstimate G, larger means more uncertainty. Default is 100. |
nudge |
A number, default 10e-10 to add to the diagonal of the covariance matrix for numerical stability. |
sbg_args |
A list of named arguments, except Y, for function 'sbgcop.mcmc()'. |
cvine_marginals |
A named list describing the univariate distribution of each predictor. See Details. |
cvine_dtypes |
A named list describing the data type of each variable. |
resamp_prob |
A vector of sampling probability for each observation in data. Must sums to 1. |
A MixtureModel object.
There are three methods to generate data:
1. Resampling: if we have enough data of the predictors, we can resample to get realistic joint distributions and dependence among them.
2. Estimation: if we have a small sample from, for example, a pilot study, we can sample from a semi-parametric copula model (Hoff 2007) after learning the dependence and univariate marginals of the predictors.
3. C-vine: if no pilot data exists, we can still set rough guesstimate of the dependence and univariate marginals. The C-vine algorithm (Joe 2006) generates positive semi-definite correlation matrix given the guesstimate G. The guesstimate G is a symmetric p x p matrix whose ij-th item is between -1 and 1 and is the guesstimate correlation between predictor ith and jth. G doesn't need to be a valid correlation matrix. The method works well when values in G are not extreme (i.e., 0.999, -0.999). Built-in functions for univariate marginals include: 'qbeta' , 'qbinom', 'qcauchy', 'qchisq', 'qexp', 'qf', 'qgamma', 'qgeom', 'qhyper', 'qlogis', 'qlnorm', 'qmultinom', 'qnbinom', 'qnorm', 'qpois', 'qt', 'qunif', 'qweibull'.
Hoff P (2007). 'Extending the rank likelihood for semiparametric copula estimation.' Ann. Appl. Stat, 1(1), 265-283.
Joe H (2006). “Generating random correlation matrices based on partial correlations.”Journal of Multivariate Analysis, 97, 2177-2189.
data("nhanes1518") xmod <- mpower::MixtureModel(nhanes1518, method = "resampling")
data("nhanes1518") xmod <- mpower::MixtureModel(nhanes1518, method = "resampling")
Visualize marginals and Gaussian copula correlations of simulated data
mplot(obj, split = TRUE)
mplot(obj, split = TRUE)
obj |
A MixtureModel object. |
split |
A logical, whether to display numbers on half of the covariance matrix. |
A 'ggplot2' graphics.
This package provides tools to set up simulations for power calculation
Combined NHANES data from the 2015-2016 and 2017-2018 cycles The weights have been adjusted according to https://wwwn.cdc.gov/nchs/nhanes/tutorials/module3.aspx
nhanes1518
nhanes1518
Data with the following variables:
Respondent sequence number
Full sample 4 year interview weight
Full sample 4 year MEC exam weight
Environmental B 4-year weights
Interview/Examination status
Gender of the participant
Age in years of the participant at the time of screening. Individuals 80 and over are top-coded at 80 years of age
A ratio of family income to poverty guidelines
Race/Hispanic origin
Total household income (reported as a range value in dollars)
Body Mass Index (kg/m**2)
Waist Circumference (cm)
Weight (kg)
Standing Height (cm)
Creatinine, urine (mg/dL)
MCNP Mono(carboxyisononyl) phthalate (ng/mL), LLOD = 0.2
MCOP Mono(carboxyisoctyl) phthalate (ng/mL), LLOD = 0.3
MECPP Mono-2-ethyl-5-carboxypentyl phthalate (ng/mL), LLOD = 0.4
MHIBP phthalate (ng/mL), LLOD = 0.4
MnBP Mono-n-butyl phthalate (ng/mL), LLOD = 0.4
MCPP Mono-(3-carboxypropyl) phthalate (ng/mL), LLOD = 0.4
MCOCH phthalate (ng/mL), LLOD = 0.5
MEP Mono-ethyl phthalate (ng/mL), LLOD = 1.2
Mono-3-hydroxy-n-butyl phthalate (ng/mL), LLOD = 0.4
MEHHP Mono-(2-ethyl-5-hydroxyhexyl) phthalate (ng/mL), LLOD = 0.4
Cyclohexane 1,2-dicarboxylic acid monohydroxy isononyl ester (ng/mL), LLOD = 0.4
MEHP Mono-(2-ethyl)-hexyl phthalate (ng/mL), LLOD = 0.8
MiBP Mono-isobutyl phthalate (ng/mL), LLOD = 0.8
MCNP Mono-isononyl phthalate (ng/mL), LLOD = 0.9
MEOHP Mono-(2-ethyl-5-oxohexyl) phthalate (ng/mL), LLOD = 0.2
MBzP Mono-benzyl phthalate (ng/mL), LLOD = 0.3
Phthalates comment code for whether the measurement is under the limit of detection
Detailed documentation of the phthalates variables can be found here:
This function creates a generative model of the outcome given a matrix of predictors.
OutcomeModel(f, family = "gaussian", sigma = 1, f_args = list())
OutcomeModel(f, family = "gaussian", sigma = 1, f_args = list())
f |
A string that describes the relationships between the predictors and
outcome or a function that takes an input matrix and returns a vector of
outcome: |
family |
A string, 'gaussian', 'binomial', or 'poisson' for continuous, binary, or count outcomes. |
sigma |
A number, Gaussian noise standard deviation if applicable. |
f_args |
A named list of additional arguments to f |
An OutcomeModel object. Attributes:
f |
mean function. |
sigma |
a number for the Gaussian observation noise. |
family |
a string 'gaussian' or 'binomial'. |
# Define BMI as a ratio of weight and height plus random Gaussian error with standard deviation 1. bmi_model <- mpower::OutcomeModel(f = 'weight/(height^2)', sigma = 1, family = 'gaussian')
# Define BMI as a ratio of weight and height plus random Gaussian error with standard deviation 1. bmi_model <- mpower::OutcomeModel(f = 'weight/(height^2)', sigma = 1, family = 'gaussian')
Partial correlations between elements in x and elements in y
partial(r, x, y)
partial(r, x, y)
r |
A correlation matrix |
x |
A vector of indices |
y |
A vector of indices |
A partial correlation matrix
Plot summaries of power simulation
plot_summary(sim, crit, thres, digits = 3, how = "greater")
plot_summary(sim, crit, thres, digits = 3, how = "greater")
sim |
A Sim or a SimCurve object, output from 'sim_power()' or 'sim_curve()'. |
crit |
A string specifying the significance criteria. |
thres |
A number or vector of numbers specifying the thresholds of "significance". |
digits |
An integer for the number of decimal points to display. |
how |
A string, whether to compare the criterion 'greater' or 'lesser' than the threshold. |
A 'ggplot2' graphics.
Fits a linear Quantile G-Computation model with no interactions
qgcomp_lin_wrapper(y, x, args = list())
qgcomp_lin_wrapper(y, x, args = list())
y |
A vector of outcome |
x |
A matrix of predictors |
args |
A list of arguments see R function 'qgcomp::qgcomp.noboot()'. |
A list
pval |
The p-value of the combined effect, the same for all predictors. |
pos_weights |
Positive weights. See 'qgcomp' package. |
neg_weights |
Negative weights. See 'qgcomp' package. |
time |
elapsed time to fit the model. |
Keil AP, Buckley JP, O’Brien KM, Ferguson KK, Zhao S, White AJ (2020). “A Quantile-based g-computation approach to addressing the effects of exposure mixtures.”Environmental Health Perspectives, 128(4).
Quantile function for the multinomial distribution, size = 1
qmultinom(p, probs)
qmultinom(p, probs)
p |
A quantile. |
probs |
A vector of probabilities for each level. |
Gives the quantile function
Convert R-squared value to the SNR
rsq2snr(r)
rsq2snr(r)
r |
R-squared value |
Rescale the mean function of an OutcomeModel to meet a given SNR
scale_f(snr, ymod, xmod, m = 5000)
scale_f(snr, ymod, xmod, m = 5000)
snr |
A SNR |
ymod |
A OutcomeModel object to modify |
xmod |
A MixtureModel object |
m |
Number of MC samples to estimate the SNR of a proposed noise variance |
A new OutcomeModel object
Rescale the noise variance of a Gaussian OutcomeModel to meet a given SNR
scale_sigma(snr, ymod, xmod, m = 5000)
scale_sigma(snr, ymod, xmod, m = 5000)
snr |
A SNR |
ymod |
A OutcomeModel object to modify |
xmod |
A MixtureModel object |
m |
Number of MC samples to estimate the SNR of a proposed noise variance |
A new OutcomeModel object
This function updates values in an OutcomeModel object
set_value(obj, name, value)
set_value(obj, name, value)
obj |
An OutcomeModel object |
name |
A string for name of the attribute to be changed |
value |
An appropriate data type |
This function can be used to create power curves by calling sim_power() on combinations of many sample sizes and signal-to-noise ratio (SNR).
sim_curve( xmod, ymod, imod, s = 100, n = 100, cores = 1, file = NULL, errorhandling = "stop", snr_iter = 10000, cluster_export = c() )
sim_curve( xmod, ymod, imod, s = 100, n = 100, cores = 1, file = NULL, errorhandling = "stop", snr_iter = 10000, cluster_export = c() )
xmod |
A MixtureModel object. |
ymod |
One or a list of OutcomeModel object(s). |
imod |
An InferenceModel object. |
s |
An integer for the number of Monte Carlo simulations. |
n |
An integer or a vector of sample sizes. |
cores |
An integer for the number of processing cores. When cores > 1, parallelism is automatically applied. |
file |
A string, a file name with no extension to write samples to periodically. By default write to an RDS file. |
errorhandling |
A string "remove", "stop", or "pass". If an error occurs in any iteration, remove that iteration (remove), return the error message verbatim in the output (pass), or terminate the loop (stop). Default is "remove". See R package 'foreach' for more details. |
snr_iter |
An integer for number of Monte Carlo samples to estimate SNR. |
cluster_export |
A vector of functions to pass to the parallel-processing clusters. |
A SimCurve object with the following attributes:
s |
a number of simulations. |
snr |
a real number or array of real numbers for SNR of each OutcomeModel. |
n |
a number or vector of sample sizes. |
xmod |
the MixtureModel used. |
ymod |
the OutcomeModel used. |
imod |
the InferenceModel used. |
sims |
a list of simulation output matrices. |
data("nhanes1518") chems <- c("URXCNP", "URXCOP", "URXECP", "URXHIBP", "URXMBP", "URXMC1", "URXMCOH", "URXMEP","URXMHBP", "URXMHH", "URXMHNC", "URXMHP", "URXMIB", "URXMNP", "URXMOH", "URXMZP") chems_mod <- mpower::MixtureModel(nhanes1518[, chems], method = "resampling") bmi_mod <- mpower::OutcomeModel(f = "0.2*URXCNP + 0.15*URXECP + 0.1*URXCOP*URXECP", family = "binomial") logit_mod <- mpower::InferenceModel(model = "glm", family = "binomial") logit_out <- mpower::sim_curve(xmod=chems_mod, ymod=bmi_mod, imod=logit_mod, s=20, n=c(500, 1000), cores=2, snr_iter=1000) logit_df <- summary(logit_out, crit="pval", thres=0.05, how="lesser") plot_summary(logit_out, crit="pval", thres=0.05, how="lesser")
data("nhanes1518") chems <- c("URXCNP", "URXCOP", "URXECP", "URXHIBP", "URXMBP", "URXMC1", "URXMCOH", "URXMEP","URXMHBP", "URXMHH", "URXMHNC", "URXMHP", "URXMIB", "URXMNP", "URXMOH", "URXMZP") chems_mod <- mpower::MixtureModel(nhanes1518[, chems], method = "resampling") bmi_mod <- mpower::OutcomeModel(f = "0.2*URXCNP + 0.15*URXECP + 0.1*URXCOP*URXECP", family = "binomial") logit_mod <- mpower::InferenceModel(model = "glm", family = "binomial") logit_out <- mpower::sim_curve(xmod=chems_mod, ymod=bmi_mod, imod=logit_mod, s=20, n=c(500, 1000), cores=2, snr_iter=1000) logit_df <- summary(logit_out, crit="pval", thres=0.05, how="lesser") plot_summary(logit_out, crit="pval", thres=0.05, how="lesser")
Power analysis using Monte Carlo simulation
sim_power( xmod, ymod, imod, s = 100, n = 100, cores = 1, file = NULL, errorhandling = "stop", snr_iter = 10000, cluster_export = c() )
sim_power( xmod, ymod, imod, s = 100, n = 100, cores = 1, file = NULL, errorhandling = "stop", snr_iter = 10000, cluster_export = c() )
xmod |
A MixtureModel object. |
ymod |
An OutcomeModel object. |
imod |
An InferenceModel object. |
s |
An integer for number of Monte Carlo simulations. |
n |
An integer for sample size in each simulation. |
cores |
An integer for number of processing cores. When cores > 1, parallelism is automatically applied. |
file |
A string, a file name with no extension to write samples to periodically. By default write to an RDS file. |
errorhandling |
A string "remove", "stop", or "pass". If an error occurs in any iteration, remove that iteration (remove), return the error message verbatim in the output (pass), or terminate the loop (stop). Default is "remove". See R package 'foreach' for more details. |
snr_iter |
An integer for number of Monte Carlo samples to estimate SNR |
cluster_export |
A vector of functions to pass to the parallel-processing clusters |
A PowerSim object. Attributes:
s |
a number of simulations. |
snr |
a real number for SNR of the OutcomeModel. |
n |
a number of sample sizes. |
xmod |
the MixtureModel used. |
ymod |
the OutcomeModel used. |
imod |
the InferenceModel used. |
sims |
an output matrices. |
data("nhanes1518") chems <- c("URXCNP", "URXCOP", "URXECP", "URXHIBP", "URXMBP", "URXMC1", "URXMCOH", "URXMEP","URXMHBP", "URXMHH", "URXMHNC", "URXMHP", "URXMIB", "URXMNP", "URXMOH", "URXMZP") chems_mod <- mpower::MixtureModel(nhanes1518[, chems], method = "resampling") bmi_mod <- mpower::OutcomeModel(f = "0.2*URXCNP + 0.15*URXECP + 0.1*URXCOP*URXECP", family = "binomial") logit_mod <- mpower::InferenceModel(model = "glm", family = "binomial") logit_out <- mpower::sim_power(xmod=chems_mod, ymod=bmi_mod, imod=logit_mod, s=100, n=2000, cores=2, snr_iter=2000) logit_df <- summary(logit_out, crit="pval", thres=0.05, how="lesser") plot_summary(logit_out, crit="pval", thres=0.05, how="lesser")
data("nhanes1518") chems <- c("URXCNP", "URXCOP", "URXECP", "URXHIBP", "URXMBP", "URXMC1", "URXMCOH", "URXMEP","URXMHBP", "URXMHH", "URXMHNC", "URXMHP", "URXMIB", "URXMNP", "URXMOH", "URXMZP") chems_mod <- mpower::MixtureModel(nhanes1518[, chems], method = "resampling") bmi_mod <- mpower::OutcomeModel(f = "0.2*URXCNP + 0.15*URXECP + 0.1*URXCOP*URXECP", family = "binomial") logit_mod <- mpower::InferenceModel(model = "glm", family = "binomial") logit_out <- mpower::sim_power(xmod=chems_mod, ymod=bmi_mod, imod=logit_mod, s=100, n=2000, cores=2, snr_iter=2000) logit_df <- summary(logit_out, crit="pval", thres=0.05, how="lesser") plot_summary(logit_out, crit="pval", thres=0.05, how="lesser")
Tabular summaries of power simulation
summary(sim, crit, thres, digits = 3, how = "greater")
summary(sim, crit, thres, digits = 3, how = "greater")
sim |
A Sim or a SimCurve object, output from 'sim_power()' or 'sim_curve()'. |
crit |
A string specifying the significance criteria. |
thres |
A number or vector of numbers specifying the thresholds of "significance". |
digits |
An integer for the number of decimal points to display. |
how |
A string, whether to compare the criterion 'greater' or 'lesser' than the threshold. |
A data.frame summary of power for each predictor for each combination of thresholds, sample size, signal-to-noise ratios.