Title: | Synthesize Data Based on Empirical Quantile Functions and Rank Order Matching |
---|---|
Description: | Data is synthesized using a combination of inverse transform sampling using the empirical quantile functions for each variable, and then copying the rank order structure from the original dataset. The syntesizer method has a tunable parameter allowing to gradually move from realistic and possibly unsafe synthetic data to decorrelated data of less utility. |
Authors: | Mark van der Loo [aut, cre]
|
Maintainer: | Mark van der Loo <[email protected]> |
License: | EUPL |
Version: | 0.4.0 |
Built: | 2025-03-09 07:02:30 UTC |
Source: | CRAN |
Create a function that accepts a non-negative integer n
, and
that returns synthetic data sampled from the emperical (multivariate)
distribution of x
.
make_synthesizer(x, ...) ## S3 method for class 'numeric' make_synthesizer(x, ...) ## S3 method for class 'integer' make_synthesizer(x, ...) ## S3 method for class 'logical' make_synthesizer(x, ...) ## S3 method for class 'factor' make_synthesizer(x, ...) ## S3 method for class 'character' make_synthesizer(x, ...) ## S3 method for class 'ts' make_synthesizer(x, ...) ## S3 method for class 'data.frame' make_synthesizer(x, rankcor = 1, ...)
make_synthesizer(x, ...) ## S3 method for class 'numeric' make_synthesizer(x, ...) ## S3 method for class 'integer' make_synthesizer(x, ...) ## S3 method for class 'logical' make_synthesizer(x, ...) ## S3 method for class 'factor' make_synthesizer(x, ...) ## S3 method for class 'character' make_synthesizer(x, ...) ## S3 method for class 'ts' make_synthesizer(x, ...) ## S3 method for class 'data.frame' make_synthesizer(x, rankcor = 1, ...)
x |
|
... |
arguments passed to other methods |
rankcor |
|
A function
accepting a single integer argument: the number
of synthesized values or records to return.
Other synthesis:
synthesize()
synth <- make_synthesizer(cars$speed) synth(10) synth <- make_synthesizer(iris) synth(6) synth(150) synth(250)
synth <- make_synthesizer(cars$speed) synth(10) synth <- make_synthesizer(iris) synth(6) synth(150) synth(250)
The propensity mean squared error is defined as
, where
is the number of
synthetic records, divided by the sum of the number of synthetic and real
records.
pmse(synth, real, model = c("lr", "rf"), nrep = NULL)
pmse(synth, real, model = c("lr", "rf"), nrep = NULL)
synth |
|
real |
|
model |
|
nrep |
|
[numeric]
scalar.
scars <- synthesize(cars) pmse(scars, cars)
scars <- synthesize(cars) pmse(scars, cars)
Create n
values or records based on the emperical (multivariate)
distribution of y
. For data frames it is possible to decorrelate synthetic
from the original variables by lowering the value for the rankcor
parameter.
synthesize(x, n = NROW(x), rankcor = 1)
synthesize(x, n = NROW(x), rankcor = 1)
x |
|
n |
|
rankcor |
|
A data object of the same type and structure as x
.
The utility of a synthetic variable is lowered by decorelating the rank
correlation between the real and synthetic data. If rankcor=1
, the
synthetic data will ordered such that it has the same rank order as the
original data. If rankcor=0
, no such reordering will take place. For
values between 0 and 1, blocks of data are randomly selected and randomly
permuted iteratively until the rank correlation between original and
synthetic data drops below the parameter.
Other synthesis:
make_synthesizer()
synthesize(cars$speed,10) synthesize(cars) synthesize(cars,25) s1 <- synthesize(iris, rankcor=1) s2 <- synthesize(iris, rankcor=0.5) s3 <- synthesize(iris, rankcor=c("Species"=0.5)) oldpar <- par(mfrow=c(2,2), pch=16, las=1) plot(Sepal.Length ~ Sepal.Width, data=iris, col=iris$Species, main="Iris") plot(Sepal.Length ~ Sepal.Width, data=s1, col=s1$Species, main="Synthetic Iris") plot(Sepal.Length ~ Sepal.Width, data=s2, col=s2$Species, main="Low utility Iris") plot(Sepal.Length ~ Sepal.Width, data=s3, col=s3$Species, main="Low utility Species") par(oldpar)
synthesize(cars$speed,10) synthesize(cars) synthesize(cars,25) s1 <- synthesize(iris, rankcor=1) s2 <- synthesize(iris, rankcor=0.5) s3 <- synthesize(iris, rankcor=c("Species"=0.5)) oldpar <- par(mfrow=c(2,2), pch=16, las=1) plot(Sepal.Length ~ Sepal.Width, data=iris, col=iris$Species, main="Iris") plot(Sepal.Length ~ Sepal.Width, data=s1, col=s1$Species, main="Synthetic Iris") plot(Sepal.Length ~ Sepal.Width, data=s2, col=s2$Species, main="Low utility Iris") plot(Sepal.Length ~ Sepal.Width, data=s3, col=s3$Species, main="Low utility Species") par(oldpar)