Package 'synMicrodata'

Title: Synthetic Microdata Generator
Description: This tool fits a non-parametric Bayesian model called a "hierarchically coupled mixture model with local dependence (HCMM-LD)" to the original microdata in order to generate synthetic microdata for privacy protection. The non-parametric feature of the adopted model is useful for capturing the joint distribution of the original input data in a highly flexible manner, leading to the generation of synthetic data whose distributional features are similar to that of the input data. The package allows the original input data to have missing values and impute them with the posterior predictive distribution, so no missing values exist in the synthetic data output. The method builds on the work of Murray and Reiter (2016) <doi:10.1080/01621459.2016.1174132>.
Authors: Hang J. Kim [aut, cre], Juhee Lee [aut], Young-Min Kim [aut], Jared Murray [aut]
Maintainer: Hang J. Kim <[email protected]>
License: GPL (>= 3)
Version: 2.1.0
Built: 2024-12-25 07:12:27 UTC
Source: CRAN

Help Index


Create a model object

Description

Create a model object for multipleSyn.

Usage

createModel(data_obj, max_R_S_K = c(30, 50, 20))

Arguments

data_obj

data object produced by readData

max_R_S_K

maximum value of the number of mixture component index (r, s, k).

Value

createModel returns a Rcpp_modelobject

See Also

multipleSyn, readData


Generate synthetic micro datasets

Description

Generate synthetic micro datasets using a hierarchically coupled mixture model with local dependence (HCMM-LC).

Usage

multipleSyn(data_obj, model_obj, n_burnin, m, interval_btw_Syn, show_iter = TRUE)

## S3 method for class 'synMicro_object'
print(x, ...)

Arguments

data_obj

data object produced by readData.

model_obj

model object produced by createModel.

n_burnin

size of burn-in.

m

number of synthetic micro datasets to be generated.

interval_btw_Syn

interval between MCMC iterations for generating synthetic micro datasets.

show_iter

logical value. If TRUE, multipleSyn will print history of (r,s,k) components on console.

x

object of class synMicro_object; a result of a call to multipleSyn().

...

further arguments passed to or from other methods.

Value

multipleSyn returns a list of the following conmponents:

synt_data

list of m synthetic micro datasets.

comp_mat

list of matrices of the mixture component indices.

orig_data

original dataset.

References

Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111(516), pp.1466-1479.

See Also

readData, createModel, plot.synMicro_object

Examples

## preparing to generate synthetic datsets
dat_obj <- readData(Y_input = iris[,1:4],
                    X_input = data.frame(Species = iris[,5]))
mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20))

## generating synthetic datasets
res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, 
                       interval_btw_Syn = 50, show_iter = FALSE)

print(res_obj)

Plot Comparing Synthetic Data with Original Input Data

Description

The plot method for synMicro_object object. This method compares synthetic datasets with original input data.

Usage

## S3 method for class 'synMicro_object'
plot(x, vars, plot_num = NULL, ...)

Arguments

x

synMicro_object object.

vars

vector of names or indices of the variables to compare.

plot_num

if plot_num is a number, returns a plot of the corresponding synthetic datset.

...

other parameters to be passed through to plotting functions.

Details

The plot takes input variables and draws the graph. The type of graph produced is contingent upon the number of categories in selected variables.

  • Putting a continuous variable produces a box plot of the selected variable.

  • Putting more than two continuous variables produces pairwise scatter plots for each pair of selected variables.

  • Putting categorical variables produce bar plot of each selected variable.

If plot_num=NULL, the function output plots for all generated synthetic datasets.

See Also

multipleSyn

Examples

## preparing to generate synthetic datsets
dat_obj <- readData(Y_input = iris[,1:4],
                    X_input = data.frame(Species = iris[,5]))
mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20))

## generating synthetic datasets
res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, 
                       interval_btw_Syn = 50, show_iter = FALSE)

print(res_obj)

## plotting synthesis datasets
### box plot
par(mfrow=c(3,2))
plot(res_obj, vars = "Sepal.Length") ## variable names


### pairwise scatter plot
plot(res_obj, vars = c(1,2)) ## or variable index


### bar plot
plot(res_obj, vars = "Species")


### specify the synthetic dattaset
par(mfrow=c(1,1))
plot(res_obj, vars = "Petal.Length", plot_num=1)

Class "Rcpp_modelobject"

Description

This class implements a joint modeling approach to generate synthetic microdata with continuous and categorical variables with possibly missing values. The method builds on the work of Murray and Reiter (2016)

Details

Rcpp_modelobject should be created with createModel. Please see the example below.

Extends

Class "C++Object", directly.

Fields

  • data_obj input dataset generated from readData.

Methods

  • multipleSyn generates synthetic micro datasets.

References

Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111(516), pp.1466-1479.

See Also

Rcpp, C++Object-class

Examples

## preparing to generate synthetic datsets
dat_obj <- readData(Y_input = iris[,1:4],
                    X_input = data.frame(Species = iris[,5]))
mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20))

## generating synthetic datasets
res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 5, 
                       interval_btw_Syn = 50, show_iter = FALSE)

print(res_obj)

Read the original datasets

Description

Read the original input datasets to be learned for synthetic data generation. The package allows the input data to have missing values and impute them with the posterior predictive distribution, so no missing values exist in the synthetic data output.

Usage

readData(Y_input, X_input, RandomSeed = 99)

## S3 method for class 'readData_passed'
print(x, ...)

Arguments

Y_input

data.frame consisting of continuous variables of the original data. It should consist only of numeric.

X_input

data.frame consisting of categorical variables of the original data. It should consist only of factor.

RandomSeed

random seed number.

x

object of class readData_passed; a result of a call to readData().

...

further arguments passed to or from other methods.

Value

readData returns an object of "readData_passed" class.

An object of class "readData_passed" is a list containing the following components:

n_sample

number of records in the input dataset.

p_Y

number of continuous variables.

Y_mat_std

matrix with standardized values of Y_input, with mean 0 and standard deviation 1.

mean_Y_input

mean vectors of original Y_input.

sd_Y_input

standard deviation vectors of original Y_input.

NA_Y_mat

matrix indicating missing values in Y_input.

p_X

number of categorical variables.

D_l_vec

numbers of levels of each categorical variable.

X_mat_std

matrix with the numeric-transformed values of X_input.

levels_X_input

list of levels of each categorical variable.

NA_X_mat

matrix indicating missing values in X_input.

var_names

list containing variable names of X_input and Y_input.

orig_data

original dataset.

See Also

multipleSyn, createModel


Summarizing synthesis results

Description

summary method for class "summary.synMicro_object".

Usage

## S3 method for class 'synMicro_object'
summary(object, max_print = 4, ...)

Arguments

object

synMicro_object object.

max_print

maximum number of synthetic datset to print summaries

...

other parameters to be passed through to other functions.

Details

summary reports the synthesis results for each variable. summary reports the synthesis results for each variable. It compares the summary statistics of each variable for the original dataset(Orig.) and synthetic datasets(synt.#), their averaging(Q_bar), and between variance(B_m).

See Also

multipleSyn

Examples

## preparing to generate synthetic datsets
dat_obj <- readData(Y_input = iris[,1:4],
                    X_input = data.frame(Species = iris[,5]))
mod_obj <- createModel(dat_obj, max_R_S_K=c(30,50,20))

## generating synthetic datasets
res_obj <- multipleSyn(dat_obj, mod_obj, n_burnin = 100, m = 2, 
                       interval_btw_Syn = 50, show_iter = FALSE)

summary(res_obj)