Title: | Synthetic Microdata Generator |
---|---|
Description: | This tool fits a non-parametric Bayesian model called a "hierarchically coupled mixture model with local dependence (HCMM-LD)" to the original microdata in order to generate synthetic microdata for privacy protection. The non-parametric feature of the adopted model is useful for capturing the joint distribution of the original input data in a highly flexible manner, leading to the generation of synthetic data whose distributional features are similar to that of the input data. The package allows the original input data to have missing values and impute them with the posterior predictive distribution, so no missing values exist in the synthetic data output. The method builds on the work of Murray and Reiter (2016) <doi:10.1080/01621459.2016.1174132>. |
Authors: | Hang J. Kim [aut, cre], Juhee Lee [aut], Young-Min Kim [aut], Jared Murray [aut] |
Maintainer: | Hang J. Kim <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.1.0 |
Built: | 2024-12-25 07:12:27 UTC |
Source: | CRAN |
Create a model object for multipleSyn
.
createModel(data_obj, max_R_S_K = c(30, 50, 20))
createModel(data_obj, max_R_S_K = c(30, 50, 20))
data_obj |
data object produced by |
max_R_S_K |
maximum value of the number of mixture component index (r, s, k). |
createModel
returns a Rcpp_modelobject
Generate synthetic micro datasets using a hierarchically coupled mixture model with local dependence (HCMM-LC).
multipleSyn(data_obj, model_obj, n_burnin, m, interval_btw_Syn, show_iter = TRUE) ## S3 method for class 'synMicro_object' print(x, ...)
multipleSyn(data_obj, model_obj, n_burnin, m, interval_btw_Syn, show_iter = TRUE) ## S3 method for class 'synMicro_object' print(x, ...)
data_obj |
data object produced by |
model_obj |
model object produced by |
n_burnin |
size of burn-in. |
m |
number of synthetic micro datasets to be generated. |
interval_btw_Syn |
interval between MCMC iterations for generating synthetic micro datasets. |
show_iter |
logical value. If |
x |
object of class |
... |
further arguments passed to or from other methods. |
multipleSyn
returns a list of the following conmponents:
synt_data |
list of |
comp_mat |
list of matrices of the mixture component indices. |
orig_data |
original dataset. |
Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111(516), pp.1466-1479.
readData
, createModel
, plot.synMicro_object
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj)
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj)
The plot
method for synMicro_object
object.
This method compares synthetic datasets with original input data.
## S3 method for class 'synMicro_object' plot(x, vars, plot_num = NULL, ...)
## S3 method for class 'synMicro_object' plot(x, vars, plot_num = NULL, ...)
x |
|
vars |
vector of names or indices of the variables to compare. |
plot_num |
if |
... |
other parameters to be passed through to plotting functions. |
The plot
takes input variables and draws the graph.
The type of graph produced is contingent upon the number of categories in selected variables.
Putting a continuous variable produces a box plot of the selected variable.
Putting more than two continuous variables produces pairwise scatter plots for each pair of selected variables.
Putting categorical variables produce bar plot of each selected variable.
If plot_num=NULL
, the function output plots for all generated synthetic datasets.
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj) ## plotting synthesis datasets ### box plot par(mfrow=c(3,2)) plot(res_obj, vars = "Sepal.Length") ## variable names ### pairwise scatter plot plot(res_obj, vars = c(1,2)) ## or variable index ### bar plot plot(res_obj, vars = "Species") ### specify the synthetic dattaset par(mfrow=c(1,1)) plot(res_obj, vars = "Petal.Length", plot_num=1)
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj) ## plotting synthesis datasets ### box plot par(mfrow=c(3,2)) plot(res_obj, vars = "Sepal.Length") ## variable names ### pairwise scatter plot plot(res_obj, vars = c(1,2)) ## or variable index ### bar plot plot(res_obj, vars = "Species") ### specify the synthetic dattaset par(mfrow=c(1,1)) plot(res_obj, vars = "Petal.Length", plot_num=1)
"Rcpp_modelobject"
This class implements a joint modeling approach to generate synthetic microdata with continuous and categorical variables with possibly missing values. The method builds on the work of Murray and Reiter (2016)
Rcpp_modelobject should be created with createModel
. Please see the example below.
Class "C++Object"
, directly.
data_obj
input dataset generated from readData
.
multipleSyn
generates synthetic micro datasets.
Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111(516), pp.1466-1479.
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj)
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, interval_btw_Syn = 50, show_iter = FALSE) print(res_obj)
Read the original input datasets to be learned for synthetic data generation. The package allows the input data to have missing values and impute them with the posterior predictive distribution, so no missing values exist in the synthetic data output.
readData(Y_input, X_input, RandomSeed = 99) ## S3 method for class 'readData_passed' print(x, ...)
readData(Y_input, X_input, RandomSeed = 99) ## S3 method for class 'readData_passed' print(x, ...)
Y_input |
data.frame consisting of continuous variables of the original data.
It should consist only of |
X_input |
data.frame consisting of categorical variables of the original data.
It should consist only of |
RandomSeed |
random seed number. |
x |
object of class |
... |
further arguments passed to or from other methods. |
readData
returns an object of "readData_passed
" class.
An object of class "readData_passed
" is a list containing the following components:
n_sample |
number of records in the input dataset. |
p_Y |
number of continuous variables. |
Y_mat_std |
matrix with standardized values of |
mean_Y_input |
mean vectors of original |
sd_Y_input |
standard deviation vectors of original |
NA_Y_mat |
matrix indicating missing values in |
p_X |
number of categorical variables. |
D_l_vec |
numbers of levels of each categorical variable. |
X_mat_std |
matrix with the numeric-transformed values of |
levels_X_input |
list of levels of each categorical variable. |
NA_X_mat |
matrix indicating missing values in |
var_names |
list containing variable names of |
orig_data |
original dataset. |
summary
method for class "summary.synMicro_object
".
## S3 method for class 'synMicro_object' summary(object, max_print = 4, ...)
## S3 method for class 'synMicro_object' summary(object, max_print = 4, ...)
object |
|
max_print |
maximum number of synthetic datset to print summaries |
... |
other parameters to be passed through to other functions. |
summary
reports the synthesis results for each variable.
summary
reports the synthesis results for each variable. It compares the summary statistics of each variable for the original dataset(Orig.
) and synthetic datasets(synt.#
), their averaging(Q_bar
), and between variance(B_m
).
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, interval_btw_Syn = 50, show_iter = FALSE) summary(res_obj)
## preparing to generate synthetic datsets dat_obj <- readData(Y_input = iris[,1:4], X_input = data.frame(Species = iris[,5])) mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20)) ## generating synthetic datasets res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, interval_btw_Syn = 50, show_iter = FALSE) summary(res_obj)