Statistical agencies and data custodians face a fundamental challenge: releasing data that are useful for analysis while protecting individual privacy. Two main approaches exist: (i) traditional anonymization methods such as perturbation, suppression, and generalization (Hundepool et al. 2012; Templ 2017), and (ii) synthetic data generation via statistical models (Nowok et al. 2016; Templ et al. 2017). Both require rigorous evaluation of disclosure risk and data utility — yet existing tools assess these two dimensions in isolation or cover only a narrow subset of the relevant metrics.
The R package sdcMicro (Templ et al. 2015) provides frequency-based risk estimation but limited utility assessment. synthpop (Nowok et al. 2016) offers the CAP disclosure metric and propensity-score utility but not distance-based or membership inference methods. In Python, SDMetrics covers distance-based metrics and Anonymeter implements GDPR failure criteria, but neither addresses attribution risk. No existing R package provides a unified framework that spans all risk families and supports multivariate comparison of multiple synthesis approaches; the most comprehensive evaluation suites (the Python libraries synthcity and SynthEval) sit outside the R ecosystem.
The riskutility package for R (R Core Team 2025) fills this gap. It implements
over 30 risk measures spanning six paradigms — frequency-based privacy
models, attribution-based (CAP), ML-based (RAPID), distance-based,
record linkage, and membership inference — alongside more than a dozen
utility functions covering global, distributional, structural, and
predictive assessment. All functions share a consistent S3 API with
print(), summary(), and plot()
methods and accept a synth_pair container that bundles data
and metadata.
The need for rigorous evaluation is well established in the
statistical disclosure control literature and has gained urgency as
synthetic data enters regulatory frameworks. The European Article 29
Working Party (Article 29 Data Protection Working
Party 2014) identifies three criteria for anonymization: singling
out, linkability, and inference. riskutility measures
the first two directly (singling_out(),
linkability()); the inference criterion is approached
through the attribution-based CAP and RAPID measures rather than a
dedicated inference-attack function. Importantly, synthetic data does
not automatically satisfy these criteria (Stadler
et al. 2022), making empirical assessment essential.
The distinction between general and specific utility (Snoke et al. 2018) is important: general utility measures (propensity scores, distributional distances) assess overall distributional fidelity, while specific utility measures (regression fidelity, TSTR) assess whether particular analyses yield similar results on original and synthetic data. A thorough evaluation should include both.
riskutility provides empirical privacy metrics rather than formal mathematical guarantees. Unlike differential privacy, which provides worst-case bounds, our metrics quantify observed risk on concrete datasets. This approach is appropriate for practical SDC workflows where the protection mechanism is not differentially private. For differentially private synthesizers, formal epsilon budgets should be used alongside empirical evaluation.
Quick start. A comprehensive risk assessment requires just a few lines:
library(riskutility)
pair <- synth_pair(original, synthetic,
key_vars = c("age", "sex", "region"),
target_var = "income")
report <- disclosure_report(pair)
print(report)This single call computes attribution-based risk (DCAP, TCAP, WEAP,
DiSCO, RAPID), distance-based risk (DCR, NNDR, IMS, dRisk, hitting
rate), privacy models (\(k\)-anonymity,
\(l\)-diversity, \(t\)-closeness), and membership inference
measures (singling out, linkability, NNAA) — producing a pass/fail
summary across all families. For comparing multiple synthesis
approaches, rumap() computes risk and utility measures
simultaneously, normalizes them to a common scale, and identifies
Pareto-optimal methods (Section @ref(sec-case-rumap)).
Paper overview. Section @ref(sec-background)
introduces the disclosure threat taxonomy and reviews related software.
Section @ref(sec-design) describes the package architecture and the
synth_pair container. Sections @ref(sec-risk) and
@ref(sec-utility) present the risk and utility measures with
mathematical formulations and worked examples. Section
@ref(sec-comprehensive) demonstrates the complete practitioner workflow
on a realistic case study, comparing three synthesis approaches with
disclosure_report() and rumap(). Section
@ref(sec-discussion) discusses limitations, remediation strategies, and
future work. Section @ref(sec-computational) provides computational
details.
| Category | Functions | Paradigm | Applicable to |
|---|---|---|---|
| Privacy models | 8 | Equivalence class | Both |
| Attribution (CAP) | 4 | Matching | Both |
| ML-based (RAPID) | 4 | Prediction | Both |
| Distance-based | 6 | Nearest neighbor | Both |
| Record linkage | 1 (8 methods) | Linkage | Both |
| Membership inference | 6 | Attack simulation | Both |
| Utility measures | 15 | Various | Both |
| Frameworks | 3 | Composite | Both |
All analyses in this paper were conducted in R (R Core Team 2025).
Every risk measure in riskutility addresses one or more of four disclosure threats. The classical SDC taxonomy (Hundepool et al. 2012; Templ 2017) recognizes three types: identity, attribute, and inferential disclosure. We extend this to four threats by separating membership disclosure (from the ML privacy literature, Shokri et al. (2017)) and memorization (relevant to generative models) as distinct categories, reflecting the broader scope of modern synthetic data evaluation:
recordLinkage(), kanonymity(),
individual_risk(), hitting_rate().dcap(), tcap(),
weap(), disco(), rapid().mia_classifier(), domias(),
nnaa(), singling_out(),
delta_presence().ims(), dcr(), nndr().| Threat | Definition | Key measures |
|---|---|---|
| Identity | Attacker links record to individual | recordLinkage, kanonymity, individual_risk |
| Attribute | Attacker learns sensitive value via linkage | dcap, tcap, weap, disco, rapid |
| Membership | Attacker determines if individual is in dataset | mia_classifier, domias, nnaa, singling_out |
| Memorization | Generator reproduces training records | ims, dcr, nndr |
riskutility implements six risk assessment paradigms, each addressing different threats from the taxonomy above. We summarize each here; full definitions and worked examples are in Section @ref(sec-risk).
Frequency-based (privacy models). These methods assess privacy properties of a single dataset based on quasi-identifier frequencies. \(k\)-Anonymity (Samarati and Sweeney 1998) requires each combination of quasi-identifiers to appear at least \(k\) times. \(l\)-Diversity (Machanavajjhala et al. 2007) and \(t\)-closeness (Li et al. 2007) progressively strengthen the protection of sensitive attributes within equivalence classes. See Section @ref(sec-privacy-models).
Attribution-based (CAP family). The Correct Attribution Probability framework (Taub et al. 2018) measures whether an attacker can infer sensitive attribute values by matching quasi-identifiers between original and released data. DCAP provides an aggregate measure; TCAP gives per-record risk scores. See Section @ref(sec-cap).
ML-based (RAPID). Risk of Attribute Prediction-Induced Disclosure (Thees et al. 2026) trains predictive models on the released data and evaluates them on the original. Accurate predictions indicate information leakage. RAPID captures complex non-linear relationships and variable interactions that the CAP matching approach may miss. See Section @ref(sec-rapid).
Distance-based. The holdout method compares distances from synthetic records to training data vs. holdout data. If synthetic records are systematically closer to training records, the generator has memorized rather than generalized. However, Yao et al. (2025) demonstrate that passing distance-based tests does not guarantee privacy (the “DCR Delusion”). See Section @ref(sec-distance).
Record linkage. Directly simulates a re-identification attack using deterministic (Gower distance), probabilistic (Fellegi-Sunter), PRAM-aware, predictive (propensity-score), or random forest linkage (Fellegi and Sunter 1969; Domingo-Ferrer and Torra 2003). See Section @ref(sec-recordlinkage).
Membership inference. Shadow model attacks (Shokri et al. 2017), density-based detection (Breugel et al. 2023), and GDPR failure criteria (Stadler et al. 2022) assess whether an attacker can determine if a specific individual’s data was used during synthesis. See Section @ref(sec-membership).
Utility measures quantify how well the released data preserves the statistical properties of the original. We organize them into four groups (details in Section @ref(sec-utility)):
Global utility measures use propensity scores to quantify how distinguishable the released data is from the original (Woo et al. 2009; Snoke et al. 2018). A single number summarizes overall data quality.
Distributional utility measures (Wasserstein, Hellinger, KS test, energy distance, MMD) compare marginal or joint distributions, identifying which specific variables or relationships are poorly reproduced.
Structural utility measures (copula fidelity, contingency fidelity, PCA comparison, correlation matrix comparison) assess whether multivariate relationships — the correlations and interactions that analysts depend on — are preserved.
Predictive utility measures (TSTR, regression fidelity, feature importance stability) test whether models trained on released data generalize to real-world predictions.
Existing R packages for statistical disclosure control — sdcMicro (Templ et al. 2015), synthpop (Nowok et al. 2016), and simPop (Templ et al. 2017) — focus primarily on applying protection methods, with risk and utility assessment provided as secondary features. riskutility inverts this emphasis: it is a dedicated evaluation package, designed to be used after protection has been applied, regardless of which tool generated the released data.
The package follows four design principles:
Consistent S3 architecture. Every major function
returns a typed S3 object with print(),
summary(), and plot() methods. We chose S3
over S4 classes for three reasons: lighter memory footprint, simpler
method dispatch for the typical R user, and easier interoperability with
data.table and ggplot2 objects. All risk/utility classes follow the same
pattern, making the API predictable once one class is learned.
Direction conventions. Risk measures are
oriented so that higher values always indicate higher disclosure risk.
Utility measures are oriented so that higher values indicate higher
utility (better preservation of statistical properties). Some measures
(e.g., pMSE, Wasserstein distance) naturally use a “lower is better”
scale; their interpretation is noted in the documentation, and
rumap() applies the necessary direction transformation when
normalizing to \([0, 1]\).
Minimal core dependencies. Frequency-based
privacy models, CAP metrics, and distance-based measures require only
base R, data.table, and ggplot2. ML-based methods require optional
packages — ranger for random forests in RAPID, xgboost for gradient
boosting, and rpart for classification trees. These are loaded
conditionally via requireNamespace() and produce
informative error messages when absent.
Integration over competition. Rather than
reimplementing synthesis or anonymization, the package provides
from_synthpop(), from_simPop(), and
from_sdcMicro() constructors that extract original and
released data from objects created by these packages, wrapping them in
the synth_pair container (Section @ref(sec-synth-pair)).
This ensures users can evaluate any protection method with a single
consistent interface.
The central data structure in riskutility is the
synth_pair object, which bundles original data, released
data, variable roles, and metadata into a single container:
pair <- synth_pair(original, synthetic,
key_vars = c("age", "gender", "region"),
target_var = "income",
holdout = holdout_data)The constructor stores the original and synthetic data frames
alongside their dimensions, automatically detects categorical
(cat_vars) and numeric (num_vars) columns, and
retains user-specified quasi-identifiers (key_vars),
sensitive attribute (target_var), and optional holdout
data. This metadata eliminates a common source of error: specifying
different key_vars for different risk measures on the same
dataset.
Once constructed, every risk and utility function in the package
accepts a synth_pair object as its first argument via S3
dispatch:
Every exported risk/utility class follows an identical three-part pattern:
# 1. Two equivalent calling conventions:
result <- dcap(pair) # synth_pair method
result <- dcap(X, Y, key_vars = ..., target_var = ...) # default method
# 2. Inspection:
print(result) # One-screen summary with key statistic
s <- summary(result) # Detailed statistics (returns summary.dcap)
print(s) # Formatted multi-line output
# 3. Visualization:
plot(result, which = 1) # Plot type 1
plot(result, which = 1:2) # Multiple plot typesThe generic function dispatches via UseMethod(). The
synth_pair method extracts original,
synthetic, key_vars, and
target_var from the container and delegates to the default
method, which performs the actual computation. The return object is a
list with a class attribute (e.g., "dcap"). The
summary() method returns a typed summary object (e.g.,
"summary.dcap") with its own print() method,
separating computation from display.
Plot methods use an integer which parameter to select
among multiple visualization types. The number of available plot types
varies by class, from one (simple measures) to seven
(rumap).
The from_* family of constructors bridges riskutility
with the three main R packages for statistical disclosure control and
synthetic data generation:
# From synthpop: pass synds object + original data
pair <- from_synthpop(synds_object, original_data,
key_vars = c("age", "sex"),
target_var = "income")
# From simPop: original data extracted automatically from simPopObj
pair <- from_simPop(simPopObj,
key_vars = c("age", "sex"),
target_var = "income")
# From sdcMicro: variable roles extracted from sdcMicroObj
pair <- from_sdcMicro(sdcMicroObj)Each constructor returns a standard synth_pair object.
from_sdcMicro() additionally extracts variable roles
(quasi-identifiers, sensitive attributes, sample weights) from the
sdcMicro S4 object, so that key_vars and
target_var need not be specified manually.
from_synthpop() supports multiple syntheses via the
m parameter, selecting a specific synthetic dataset from a
synds object. from_simPop() extracts sample
weights when available, enabling weighted risk calculations.
This section presents the six risk measure families, each illustrated with a worked example using the same running dataset.
set.seed(42)
n <- 500
original <- data.frame(
age = sample(18:85, n, replace = TRUE),
sex = factor(sample(c("M", "F"), n, replace = TRUE)),
education = factor(sample(c("Primary", "Secondary", "Tertiary"), n,
replace = TRUE, prob = c(0.3, 0.5, 0.2))),
region = factor(sample(paste0("R", 1:5), n, replace = TRUE)),
income = round(rlnorm(n, log(40000), 0.5))
)
# Synthetic: independent draws (low risk expected)
synthetic <- data.frame(
age = sample(18:85, n, replace = TRUE),
sex = factor(sample(c("M", "F"), n, replace = TRUE)),
education = factor(sample(c("Primary", "Secondary", "Tertiary"), n,
replace = TRUE, prob = c(0.3, 0.5, 0.2))),
region = factor(sample(paste0("R", 1:5), n, replace = TRUE)),
income = round(rlnorm(n, log(40000), 0.5))
)
key_vars <- c("age", "sex", "education", "region")
target_var <- "income"
pair <- synth_pair(original, synthetic,
key_vars = key_vars, target_var = target_var)
# Train/holdout split for distance-based metrics
set.seed(123)
train_idx <- sample(n, size = floor(0.7 * n))
train_data <- original[train_idx, ]
holdout_data <- original[-train_idx, ]These methods assess privacy properties of a single dataset based on quasi-identifier frequencies. Originally developed for traditionally anonymized data, they apply equally to synthetic data. They do not compare original and released data; instead, they evaluate structural properties of the released data alone.
\(k\)-Anonymity (Samarati and Sweeney 1998) partitions records into equivalence classes (ECs) based on quasi-identifier values. The dataset satisfies \(k\)-anonymity if every EC contains at least \(k\) records: \(k = \min_i |\text{EC}(\mathbf{q}_i)|\), where \(\mathbf{q}_i\) is the quasi-identifier vector of record \(i\). Small ECs are vulnerable to identity disclosure because an attacker who knows a target’s quasi-identifiers can narrow them to fewer than \(k\) candidates. \(k\)-Anonymity protects against identity disclosure but not attribute disclosure: if all \(k\) records in a class share the same sensitive value, the attribute is trivially revealed.
\(l\)-Diversity (Machanavajjhala et al. 2007) strengthens \(k\)-anonymity by requiring that each EC contains at least \(l\) distinct values of the sensitive attribute. This prevents attribute disclosure even when all records in an EC share the same sensitive value (homogeneity attack).
\(t\)-Closeness (Li et al. 2007) further requires that the distribution of the sensitive attribute within each EC is close to its overall distribution. The distance is measured by the Earth Mover’s Distance (EMD), and \(t\) is the maximum EMD across all ECs.
# k-Anonymity: minimum equivalence class size
k_res <- riskutility::kanonymity(synthetic, key_vars = key_vars)
k_res
#> k-Anonymity Assessment
#> ======================
#>
#> Key variables: age, sex, education, region
#> Records: 500 | Equivalence classes: 430
#>
#> Results:
#> Achieved k-anonymity level: 1
#> Target k: 5
#> Satisfies 5 -anonymity: NO
#>
#> Violations:
#> Records violating k = 5 : 500 (100.0%)
#> Unique records (k=1): 366
#>
#> Risk Assessment:
#> HIGH RISK: Unique records exist that can be directly identified.
# l-Diversity: sensitive attribute diversity per EC
l_res <- riskutility::ldiversity(synthetic, key_vars = key_vars,
sensitive_var = target_var)
print(l_res)
#> l-Diversity Assessment
#> ======================
#>
#> Key variables: age, sex, education, region
#> Sensitive variable: income
#> Records: 500 | Equivalence classes: 430
#>
#> Target l = 2
#>
#> Distinct l-Diversity:
#> Achieved level: 1
#> Satisfies 2 -diversity: NO
#> Violating records: 366 (73.2%)
#>
#> Entropy l-Diversity:
#> Achieved level: 1.00
#> Satisfies 2 -diversity: NO
#> Violating records: 366 (73.2%)
#>
#> Recursive (c,l)-Diversity (c = 2 ):
#> Satisfied: NO
# t-Closeness: EMD between EC and overall distribution
t_res <- riskutility::tcloseness(synthetic, key_vars = key_vars,
sensitive_var = target_var)
t_res
#> t-Closeness Assessment
#> ======================
#>
#> Key variables: age, sex, education, region
#> Sensitive variable: income (numeric)
#> Records: 500 | Equivalence classes: 430
#>
#> Target t = 0.2
#>
#> Results:
#> Maximum EMD: 0.5000
#> Satisfies t-closeness: NO
#> Violating records: 426 (85.2%)
#>
#> Interpretation:
#> 395 of 430 equivalence classes exceed the threshold.With 68 unique age values, 2 sex levels, 3 education levels, and 5 regions, there are up to \(68 \times 2 \times 3 \times 5 = 2040\) possible QI combinations for only \(n = 500\) records. Most equivalence classes are singletons, yielding \(k = 1\) — this is expected for fine-grained quasi-identifiers and does not by itself indicate a problem with the synthetic data. The three models form a hierarchy: \(k\)-anonymity guards against identity disclosure, \(l\)-diversity against homogeneity attacks, and \(t\)-closeness against skewness attacks.
Additional frequency-based measures include
individual_risk() for per-record re-identification
probability based on EC frequencies (Franconi and
Polettini 2004; Skinner and Elliot 2002),
attacker_risk() for scenario-based assessment under
prosecutor, journalist, and marketer attacker models (Hundepool et al.
2012), suda() for detecting records unique on small
QI subsets, and population_uniqueness() for estimating
population-level uniques via super-population models (Reiter 2005).
| Function | Input | Key output | Threats |
|---|---|---|---|
| kanonymity() | Single dataset | Min EC size | Identity |
| ldiversity() | Single dataset | Min distinct values per EC | Attribute |
| tcloseness() | Single dataset | Max EMD across ECs | Attribute |
| suda() | Single dataset | SUDA scores | Identity |
| individual_risk() | Single dataset | Per-record frequency risk | Identity |
| population_uniqueness() | Single dataset | Estimated pop. uniques | Identity |
| attacker_risk() | Single dataset | Scenario-based risk | Identity |
| epsilon_identifiability() | Single dataset | Identifiability fraction | Identity |
The Correct Attribution Probability framework (Taub et al. 2018) measures attribute disclosure: can an attacker infer a sensitive value by matching quasi-identifiers between original and released data?
For each original record \(i\) with
quasi-identifier values \(\mathbf{q}_i\), the attacker finds all
records in the released data whose quasi-identifiers match \(\mathbf{q}_i\) (the equivalence
class). If the sensitive attribute is homogeneous within this
class, the attacker learns the true value. The Targeted
CAP (TCAP) gives each record a risk score between 0 and 1:
\(\text{TCAP}_i = \Pr(\text{correct
attribution} \mid \mathbf{q}_i)\). The mean CAP across all
records is \(\overline{\text{CAP}} = n^{-1}
\sum_i \text{CAP}_i\) (returned as cap). The
Differential CAP subtracts the baseline (modal-class)
attribution rate, \(\text{DCAP} =
\overline{\text{CAP}} - \text{baseline}\) (returned as
dcap; Taub et al. (2018)), so
a value near zero indicates no attribution gain over random guessing.
The summary() method also reports the risk ratio \(\overline{\text{CAP}} / \text{baseline}\)
to contextualize the result.
The WEAP (Within-EC Attribution Probability) evaluates risk from the released data alone, without access to the original, making it suitable for data custodians who cannot share the original data with an auditor. DiSCO (Disclosive in Synthetic, Correct in Original) identifies records that are both confidently attributed in the released data and correctly attributed in the original.
# TCAP: per-record risk (most informative member of CAP family)
tcap_res <- tcap(pair)
summary(tcap_res)
#> Summary: Correct Attribution Probability (CAP) Analysis
#> =======================================================
#> Key variables: age, sex, education, region
#> Target variable: income
#> Method: exact
#>
#> CAP Statistics:
#> Mean (CAPd): 0
#> Max: 0
#> Median: 0
#> SD: 0
#> Baseline: 0.002
#> Risk ratio: 0
#>
#> Certain Disclosure:
#> Records at certain risk: 0 / 117 (0.0%)
#> (Key uniquely determines target in both original and synthetic)
#>
#> CAP Distribution (quantiles):
#> 0% 25% 50% 75% 90% 100%
#> 0 0 0 0 0 0
#>
#> Risk Categories:
#> High risk (CAP >= 0.8): 0 (0.0%)
#> Medium risk (0.5 <= CAP < 0.8): 0
#> Low risk (CAP < 0.5): 117
#>
#> Matching Statistics:
#> Records matched: 117 / 500
#> Records unmatched: 383
#> Avg matches per record: 1.2
plot(tcap_res)Since our running example uses independently generated synthetic data (no relationship to the original), TCAP values should be close to the baseline attribution probability. Records with TCAP above 0.1 warrant closer inspection.
| Metric | Requires original? | Per-record? | Measures | Low risk |
|---|---|---|---|---|
| DCAP | Yes | No | Mean attribution probability | ratio < 1.5 |
| TCAP | Yes | Yes | Individual attribution risk | < 0.1 per record |
| WEAP | No | Yes | Within-EC homogeneity | < 0.1 |
| DiSCO | Yes | Yes | Correct + confident attribution | < 5% |
Risk of Attribute Prediction-Induced Disclosure (Thees et al. 2026) takes a fundamentally different approach to attribute disclosure. Instead of matching quasi-identifiers, RAPID trains a predictive model \(\hat{f}\) on the released data \((Y_{\mathcal{K}}, Y_s)\) and evaluates its predictions on the original data: \(\hat{s}_i = \hat{f}(X_{\mathcal{K},i})\). For numeric targets, a record is at risk when the prediction error falls below a threshold \(\epsilon\): \(e(s_i, \hat{s}_i) < \epsilon\), where \(e(\cdot, \cdot)\) is a configurable error metric (symmetric percentage error by default). The RAPID score is the fraction of at-risk records: \(\text{RAPID} = n^{-1} \sum_i \mathbf{1}(e(s_i, \hat{s}_i) < \epsilon)\). For categorical targets, a different evaluation applies: a record is at risk when a gain or ratio score exceeds a threshold.
rapid_res <- rapid(pair, model_type = "lm")
summary(rapid_res)
#> Summary: RAPID Disclosure Risk Assessment
#> ==========================================
#>
#> Configuration:
#> Model: lm
#> Method: symmetric_percentage
#> Threshold: 10
#> Key variables: age, sex, education, region
#> Target variable: income
#>
#> Risk Assessment:
#> RAPID score: 0.1500
#> Records at risk: 75 / 500 (15.00%)
#> Risk level: HIGH
#>
#> Model Performance:
#> MAE: 18392.4849
#> RMSE: 25265.2739
#> Relative MAE: 0.4022
#> Relative RMSE: 1.0065
#>
#> Interpretation Guidelines:
#> High risk: Significant disclosure risk detected. Consider additional protection.
plot(rapid_res, which = c(1, 3))With independently generated synthetic data, we expect the RAPID
score to be close to the baseline. The threshold sensitivity plot
(which = 3) shows how the at-risk fraction changes across
threshold values.
RAPID complements the CAP family in two ways. First, it captures
non-linear relationships and variable interactions. Second, it provides
inferential tools: rapid_test() computes a
permutation-based \(p\)-value,
confint() provides bootstrap confidence intervals, and
rapid_threshold_select() optimizes the threshold in a
data-driven manner.
| Model | Package | Numeric | Categorical | Interactions |
|---|---|---|---|---|
| lm | stats | Yes | No | Manual |
| rf | ranger | Yes | Yes | Automatic |
| cart | rpart | Yes | Yes | Automatic |
| gbm | xgboost | Yes | Yes | Automatic |
| logit | stats | No | Yes | Manual |
| RAPID Score | Risk Level | Interpretation |
|---|---|---|
| < 0.05 | Low | ML model cannot predict target much better than baseline |
| 0.05–0.15 | Moderate | Some predictive signal from synthetic data |
| 0.15–0.30 | Elevated | Significant predictive leakage |
| > 0.30 | High | Strong evidence of disclosure risk |
Distance-based methods detect memorization: the failure mode where a generative model reproduces training records verbatim or near-verbatim. The key idea is the holdout method — split the original data into a training set \(T\) (used for synthesis) and a holdout set \(H\) (unseen by the generator). For each synthetic record \(y_j\), compute the Distance to Closest Record in \(T\) (\(d_T\)) and in \(H\) (\(d_H\)). If the generator has generalized, \(d_T\) and \(d_H\) should be comparable. If it has memorized, \(d_T\) will be systematically smaller:
\[\text{DCR\_share} = n^{-1} \sum_j \mathbf{1}\bigl(d_T(y_j) < d_H(y_j)\bigr), \qquad \text{DCR\_ratio} = \frac{\bar{d}_T}{\bar{d}_H}\]
A DCR share meaningfully above 0.5 (the package flags shares above
0.55), or a ratio below about 1, suggests memorization. The
NNDR (Nearest Neighbor Distance Ratio) provides a
complementary view: for each synthetic record, it is the ratio of the
distance to its nearest neighbor over the distance to its second-nearest
neighbor. A ratio near 0 indicates a single dominant match; near 1
indicates no distinctive match. IMS (Identical Match
Share) counts exact copies. When an explicit holdout is unavailable,
holdout_fraction automatically splits the original
data.
dcr_res <- dcr(pair, holdout_fraction = 0.2)
summary(dcr_res)
#> Summary: Distance to Closest Record (DCR)
#> ==========================================
#> WARNING: DCR has known limitations - see arXiv:2505.01524
#>
#> Method: gower
#> Variables: 5
#>
#> Dataset Sizes:
#> Training: 400 | Holdout: 100 | Synthetic: 500
#>
#> Key Metrics:
#> DCR ratio: 0.4869 (ideal: ~1.0)
#> DCR share: 81.4% (ideal: ~50%)
#> Privacy: WARNING
#>
#> Statistical Tests:
#> Wilcoxon p-value: < 2.22e-16
#> Null comparison p-value: 0.5
#> Null share distribution: mean = 0.805 , 95% CI = [ 0.68 , 0.921 ]
#>
#> DCR to Training:
#> Mean: 0.022 | Median: 0.017 | SD: 0.0177
#> Quantiles:
#> 0% 5% 25% 50% 75% 95% 100%
#> 0.0001 0.0040 0.0102 0.0170 0.0294 0.0570 0.1402
#>
#> DCR to Holdout:
#> Mean: 0.0452 | Median: 0.0378 | SD: 0.0281
#> Quantiles:
#> 0% 5% 25% 50% 75% 95% 100%
#> 0.0023 0.0106 0.0241 0.0378 0.0634 0.0993 0.2014
#>
#> Proximity Analysis:
#> Closer to training: 407 (81.4%)
#> Closer to holdout: 93 (18.6%)
#> Identical to training (DCR=0): 0
plot(dcr_res, which = 1)The DCR Delusion. Yao et al.
(2025) show that DCR can fail to detect privacy leakage: datasets
deemed “private” by DCR may still be vulnerable to membership inference
attacks. Their central recommendation is that DCR be interpreted
relative to a proper null distribution rather than in absolute terms;
dcr() implements exactly this, comparing the observed share
against a permutation null (null_test) and reporting a
Wilcoxon test alongside the point estimate. Even so, distance-based
metrics should always be complemented with other risk families.
| Metric | Holdout | Detects | Low risk |
|---|---|---|---|
| DCR | Yes | Memorization | share < 0.55 |
| NNDR | Yes | Memorization | share < 0.55 |
| IMS | No | Exact copies | < 0.01 |
| RF proximity | Yes | Memorization (non-linear) | ratio near 1 |
| dRisk | No | Close records | < 0.05 |
| Hitting rate | No | Close records | < 0.05 |
| Epsilon ID | No | Identifiability | < 0.01 |
| Delta-presence | No | Membership bounds | > 0.5 |
RF proximity offers a data-adaptive alternative: it
trains a random forest to discriminate original from synthetic records
and measures how often synthetic records share terminal nodes with
training vs. holdout records, capturing non-linear proximity that fixed
distance metrics may miss. Use rf_privacy() when complex
interactions are expected.
The recordLinkage() function directly simulates a
re-identification attack by linking each original record to the most
similar record(s) in the anonymized dataset. Eight methods are
implemented, spanning deterministic (Gower distance), probabilistic
(Fellegi-Sunter, Fellegi and Sunter
(1969)), PRAM-aware, predictive (propensity score), random forest
proximity, rank-based (RBRL, Muralidhar and
Domingo-Ferrer (2016)), robust Mahalanobis (Templ and Meindl 2008), and autoencoder
embedding (Guo and Berkhahn 2016). Three
matching modes are available: independent (many-to-one), bijective
(one-to-one via Hungarian algorithm, Herranz et
al. (2016)), and optimal transport (Sinkhorn).
For full details on all methods and matching modes, see
?recordLinkage.
| Method | Distance | Mixed types | Matching |
|---|---|---|---|
| Deterministic | Gower | Yes | All 3 |
| Probabilistic | Fellegi-Sunter | Yes | All 3 |
| PRAM | Transition prob. | Categorical | All 3 |
| Predictive | Propensity | Yes | All 3 |
| RF | RF proximity | Yes | All 3 |
| RBRL | Rank-based | Yes | Independent |
| Mahalanobis | Mahalanobis | Numeric | All 3 |
| Embedding | Autoencoder | Yes | All 3 |
This section groups two related but distinct concerns. The first
three measures (mia_classifier(), domias(),
nnaa()) assess membership disclosure — whether a
membership inference attack (MIA) can determine if a specific
individual’s data was used during synthesis. The singling out and
linkability attacks operationalize two of the GDPR anonymization
criteria (Article 29 Data Protection Working
Party 2014), following the attack-based approach of Giomi et al. (2023).
NNAA (Nearest Neighbor Adversarial Accuracy, Yale et al. (2020)) is based on the adversarial accuracy of a nearest-neighbour two-sample comparison, \(\text{AA}(A,S) = \tfrac{1}{2}\bigl[\Pr(d_{AS} > d_{AA}) + \Pr(d_{SA} > d_{SS})\bigr]\). The reported privacy loss is \(\text{AA}(\text{holdout}, S) - \text{AA}(\text{train}, S)\); a positive value means synthetic records resemble training records more closely than holdout records, indicating memorization:
nnaa_res <- nnaa(train_data, synthetic, holdout = holdout_data,
method = "gower", seed = 42)
print(nnaa_res)
#> Nearest-Neighbor Adversarial Accuracy (NNAA)
#> =============================================
#> Method: gower
#> Variables used: 5
#>
#> Dataset Sizes:
#> Training records: 350
#> Holdout records: 150
#> Synthetic records: 500
#>
#> Adversarial Accuracy (ideal ~ 0.5):
#> AA (train vs synth): 0.4599
#> Left (real NN > synth NN): 0.3457
#> Right (synth NN > real NN): 0.574
#> AA (holdout vs synth): 0.4743
#> Left (hold NN > synth NN): 0.2467
#> Right (synth NN > hold NN): 0.702
#>
#> Privacy Loss (ideal ~ 0):
#> Privacy Loss: 0.0145
#>
#> Privacy Assessment: PASS
#> No evidence of training data memorization.| Privacy Loss | Interpretation |
|---|---|
| Near 0 | No detectable leakage (ideal) |
| 0.01–0.05 | Minor leakage, likely acceptable |
| > 0.10 | Significant memorization |
Singling out and linkability operationalize two of the Article 29 Working Party’s three anonymization criteria (the third, inference, is addressed by the attribution-based CAP and RAPID measures):
so_res <- singling_out(original, synthetic,
n_attacks = 500, n_cols = 3,
mode = "multivariate", seed = 42)
print(so_res)
#> Singling Out Risk Assessment
#> ============================
#> Mode: multivariate | Columns per predicate: 3
#> Variables used: 5
#>
#> Dataset Sizes:
#> Original (training): 250
#> Holdout: 250
#> Synthetic: 500
#>
#> Attack Results (500 predicates):
#> Singling out in original: 52 (10.4%)
#> Singling out in holdout: 48 (9.6%)
#>
#> Risk Score:
#> Residual risk: 0.0088 [0.0000, 0.0655] (95% CI)
#>
#> Privacy Assessment: PASS
#> Singling out risk is within acceptable bounds (<= 0.1).
link_res <- linkability(original, synthetic,
n_attacks = 500, n_neighbors = 1, seed = 42)
print(link_res)
#> Linkability Risk Assessment
#> ===========================
#> Auxiliary columns: age, income
#> Secret columns: sex, education, region
#> Neighbors considered: 1
#>
#> Dataset Sizes:
#> Original (training): 250
#> Holdout: 250
#> Synthetic: 500
#>
#> Attack Results (500 attacks):
#> Successful links in original: 1 (0.2%)
#> Successful links in holdout: 3 (0.6%)
#>
#> Risk Score:
#> Residual risk: 0 [0.0000, 0.0092] (95% CI)
#>
#> Privacy Assessment: PASS
#> Linkability risk is within acceptable bounds (<= 0.1).| Metric | Attack type | Holdout | GDPR criterion | Low risk |
|---|---|---|---|---|
| MIA classifier | Shadow model | Yes | – | < 0.55 |
| DOMIAS | Density overfitting | Yes | – | < 0.6 |
| NNAA | Nearest neighbor | Yes | – | < 0.05 |
| Singling out | Predicate-based | Yes | Art. 29 WP | < 0.1 |
| Linkability | Record linkage | Yes | Art. 29 WP | < 0.1 |
| delta-Presence | Membership bounds | No | – | > 0.5 |
No single metric tells the full story. Applying all families to the same dataset reveals complementary and sometimes contradictory information:
# Near-copy: original + small noise (high risk expected)
set.seed(99)
near_copy <- original
near_copy$age <- near_copy$age + sample(-1:1, n, replace = TRUE)
near_copy$income <- near_copy$income + round(rnorm(n, 0, 500))
pair_risky <- synth_pair(original, near_copy,
key_vars = key_vars, target_var = target_var)
# Compare key metrics across the two datasets
comparison <- data.frame(
Metric = c("DCAP", "RAPID (lm)", "IMS"),
Safe = c(
dcap(pair)$dcap,
rapid(pair, model_type = "lm", verbose = FALSE)$rapid,
ims(pair)$ims
),
Risky = c(
dcap(pair_risky)$dcap,
rapid(pair_risky, model_type = "lm", verbose = FALSE)$rapid,
ims(pair_risky)$ims
)
)
comparison$Safe <- round(comparison$Safe, 4)
comparison$Risky <- round(comparison$Risky, 4)
knitr::kable(comparison,
caption = "Cross-family comparison: safe vs. risky synthetic data.")| Metric | Safe | Risky |
|---|---|---|
| DCAP | -0.0137 | 0.4863 |
| RAPID (lm) | 0.1500 | 0.1240 |
| IMS | 0.0000 | 0.0000 |
The near-copy shows elevated risk across all families, but the magnitude and interpretation differ. Attribution measures quantify information leakage; distance-based measures quantify memorization. These complementary perspectives mean that a dataset can pass one family’s tests while failing another’s — a thorough evaluation uses at least one measure from each family.
A dataset that passes all risk checks but destroys the analytical value of the data is useless. Utility measures quantify how well the released data preserves the statistical properties of the original.
Global utility measures give a single-number verdict by asking: can a classifier tell original and synthetic records apart?
The propensity score method (Woo et al. 2009; Snoke et al. 2018) pools original (\(X\), \(n_X\) records) and synthetic (\(Y\), \(n_Y\) records) data, labels them (0/1), and fits a classifier. The pMSE (propensity score Mean Squared Error) measures how well the model discriminates:
\[\text{pMSE} = \frac{1}{N} \sum_{i=1}^{N} \left(\hat{p}_i - c\right)^2\]
where \(N = n_X + n_Y\) and \(c = n_Y / N\). If original and synthetic records are indistinguishable, pMSE \(\approx 0\).
prop_res <- propscore(pair)
summary(prop_res)
#> Propensity Score Utility Summary
#> ================================
#> Method: rf
#> Sample sizes: n_original = 500 , n_synthetic = 500
#> Class ratio (cr): 0.5
#>
#> Propensity Score Statistic (pMSE): 0.02519
#> PS ratio (below/above cr): 0.8868
#>
#> Mean propensity (original): 0.4926
#> Mean propensity (synthetic): 0.4853
#>
#>
#> Density diagnostics:
#> KL divergence: 0.005848
#> KL divergence (Bayes space): 0.005803
#> Mean density ratio: 1.0014 (sd: 0.0034 )
#> Mean density ratio (Bayes): 0.0136 (sd: 0.0034 )| pMSE Value | Interpretation |
|---|---|
| < 0.01 | Excellent fidelity |
| 0.01–0.05 | Good fidelity |
| 0.05–0.10 | Moderate differences |
| > 0.10 | Poor fidelity |
When global utility is poor, per-variable measures identify which variables are responsible. For numeric variables, the Wasserstein distance measures the cost of transforming one distribution into another. For categorical variables, the Hellinger distance measures distributional overlap:
\[H(p, q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{k=1}^{K} \left(\sqrt{p_k} - \sqrt{q_k}\right)^2}\]
# Hellinger distance for categorical variables
h_res <- hellinger(original, synthetic, vars = c("sex", "education"))
print(h_res)
#> Hellinger Distance - Categorical Distribution Comparison
#> =========================================================
#>
#> Dataset Sizes:
#> Original (X): 500 | Synthetic (Y): 500
#> Variables compared: 2
#>
#> Summary:
#> Mean Hellinger distance: 0.0283
#> Max Hellinger distance: 0.0467
#> Min Hellinger distance: 0.0098
#> Utility score (1-mean): 0.9717
#>
#> Interpretation:
#> EXCELLENT: Categorical distributions are very similar.
# CI proximity: confidence interval overlap for means
cip_res <- ci_proximity(original, synthetic, vars = c("age", "income"))
print(cip_res)
#> Confidence Interval Proximity - Statistical Inference Preservation
#> ===================================================================
#>
#> Configuration:
#> Confidence level: 95 %
#> Variables compared: 2
#> Sample sizes: X = 500 , Y = 500
#>
#> Summary Metrics:
#> Mean proximity score: 0.7936 (1 = perfect)
#> Mean CI overlap: 0.6075 (1 = complete)
#> Mean relative error: 0.0212 (0 = perfect)
#> CIs containing orig mean: 50.0%
#>
#> Interpretation:
#> MODERATE: Some degradation of statistical properties.The CI proximity measure (Karr et al. 2006) compares confidence intervals of summary statistics (means) between original and synthetic data. An overlap near 1 means the intervals coincide; a relative error near 0 means point estimates are close.
Marginal distributions can match perfectly while joint distributions diverge. The energy distance (Székely and Rizzo 2013) is a multivariate two-sample statistic sensitive to differences in both location and scale (lower values indicate closer joint distributions):
e_res <- energy_distance(original[, c("age", "income")],
synthetic[, c("age", "income")],
seed = 42)
print(e_res)
#> Energy Distance - Multivariate Numeric Distribution Comparison
#> ===============================================================
#>
#> Dataset Sizes:
#> Original (X): 500
#> Synthetic (Y): 500
#> Variables: 2
#> Standardized: TRUE
#>
#> Energy Distance:
#> Raw: 0.0033
#> Normalized: 0.0019
#> Utility: 0.9967 (exp(-E), higher=better)
#>
#> Distance Components:
#> Mean dist(X,Y): 1.7317
#> Mean dist(X,X): 1.6917
#> Mean dist(Y,Y): 1.7685
#>
#> Interpretation:
#> EXCELLENT: Multivariate distributions are very similar.The MMD (Maximum Mean Discrepancy, Gretton et al. (2012)) provides a kernel-based alternative supporting exact computation and random Fourier features (RFF) for large datasets:
mmd_res <- mmd(original[, c("age", "income")],
synthetic[, c("age", "income")],
kernel = "gaussian", method = "rff",
n_features = 500, seed = 42)
print(mmd_res)
#> Maximum Mean Discrepancy (MMD)
#> ==============================
#>
#> Dataset Sizes:
#> Original (X): 500
#> Synthetic (Y): 500
#> Variables: 2
#> Standardized: TRUE
#>
#> Settings:
#> Kernel: gaussian
#> Method: rff
#> Features: 500
#> Bandwidth (sigma): 1.6006
#>
#> Results:
#> MMD^2: 0.002979
#> Utility: 0.9988 (exp(-MMD^2/sigma^2), higher=better)
#>
#> Interpretation:
#> EXCELLENT: Distributions are very similar.Copula fidelity compares the empirical copula (rank dependence structure) using the Cramér-von Mises statistic on pairwise copula CDFs. Contingency fidelity (Snoke et al. 2018) is its categorical complement, computing total variation distance between bivariate contingency tables:
cop_res <- copula_fidelity(original, synthetic, vars = c("age", "income"))
print(cop_res)
#> Copula Fidelity - Empirical Copula Dependence Comparison
#> ========================================================
#>
#> Dataset Sizes:
#> Original (X): 500
#> Synthetic (Y): 500
#> Variables: 2
#> Grid resolution: 50
#>
#> Results:
#> Mean CvM distance: 0.000124
#> Utility score: 0.9877 (1/(1+CvM*100), higher=better)
#>
#> Pairwise CvM Distances:
#> age vs income : 0.000124
#>
#> Interpretation:
#> EXCELLENT: Dependence structure is very well preserved.
ctf_res <- contingency_fidelity(original, synthetic,
vars = c("sex", "education", "region"))
print(ctf_res)
#> Contingency Fidelity - Categorical Dependence Comparison
#> ========================================================
#>
#> Dataset Sizes:
#> Original (X): 500
#> Synthetic (Y): 500
#> Categorical variables: 3
#>
#> Results:
#> Mean TV distance: 0.074667
#> Utility score: 0.9253 (1 - mean_tv, higher=better)
#>
#> Pairwise TV Distances (3/3 pairs):
#> sex vs education : 0.066000
#> sex vs region : 0.084000
#> education vs region : 0.074000
#>
#> Interpretation:
#> GOOD: Categorical dependence structure is reasonably preserved.TSTR (Train on Synthetic, Test on Real, Zhao et al. (2021)) trains a predictive model on the synthetic data and evaluates performance on held-out real data. The ratio of TSTR-to-TRTR performance quantifies how well predictive relationships are preserved:
set.seed(42)
tstr_res <- tstr(pair, target_var = "income", model = "rf",
test_fraction = 0.3, seed = 42)
print(tstr_res)
#> Train on Synthetic, Test on Real (TSTR)
#> ========================================
#>
#> Dataset Sizes:
#> Original training: 350
#> Original test: 150
#> Synthetic: 500
#> Test fraction: 30.0%
#>
#> Settings:
#> Model: rf
#> Target(s): income
#> Metric: R2
#>
#> Results:
#> TRTR performance: -0.0115 (baseline)
#> TSTR performance: -0.0196
#> TSTR ratio: 1.6994
#> Utility score: 1.0000 (higher = better)
#>
#> Interpretation:
#> EXCELLENT: Synthetic data preserves predictive structure very well.Regression fidelity (Karr et al. 2006) fits the same regression model on both datasets and compares coefficient estimates via CI overlap, standardized bias, and significance agreement:
reg_res <- regression_fidelity(original, synthetic,
formula = income ~ age + sex + education)
summary(reg_res)
#> Summary: Regression Fidelity
#> ============================
#>
#> Formula: income ~ age + sex + education
#> Model: lm | Conf. level: 0.95
#> Samples: X = 500 , Y = 500
#>
#> Coefficient Comparison:
#> Term Est.Orig Est.Synth Bias Std.Bias CI.Overlap
#> (Intercept) 40156.3727 45857.2451 5700.8724 1.5177 0.6045
#> age 70.3695 -1.0538 -71.4233 -1.1833 0.6887
#> sexM 47.8233 2994.6773 2946.8541 1.3061 0.6686
#> educationSecondary 2162.0872 -3140.4233 -5302.5105 -2.0300 0.4815
#> educationTertiary 4343.8707 574.6118 -3769.2589 -1.1864 0.6971
#> Sig.Agree
#> yes
#> yes
#> yes
#> yes
#> yes
#>
#> Summary Statistics:
#> Utility score (mean CI overlap): 0.6281
#> Mean |standardized bias|: 1.4447
#> Significance agreement rate: 100.0%
plot(reg_res, which = 1)Tail fidelity assesses how well extreme values are preserved — critical for applications where tail behavior matters (financial risk, rare diseases):
tail_res <- tail_fidelity(original, synthetic, vars = c("age", "income"),
percentile = 95, tails = "both")
print(tail_res)
#> Tail Fidelity - Tail Preservation Utility Measure
#> ==================================================
#>
#> Dataset Sizes:
#> Original (X): 500
#> Synthetic (Y): 500
#> Numeric variables: 2
#>
#> Settings:
#> Percentile: 95
#> Tails: both
#> Hill estimator: FALSE
#>
#> Results:
#> QQ tail divergence: 0.0481
#> Utility score: 0.9531 (exp(-QQ_div), higher=better)
#>
#> Per-variable QQ tail divergence:
#> age qq=0.0250 jsd=-19.7770
#> income qq=0.0712 jsd=-14.6344
#>
#> Interpretation:
#> EXCELLENT: Tail distributions are very well preserved.Subgroup utility (Snoke et al. 2018) applies any utility measure to each subgroup defined by a grouping variable, identifying groups with low utility:
su_res <- subgroup_utility(original, synthetic, group_var = "region",
utility_fun = energy_distance,
threshold = 0.5, seed = 42)
print(su_res)
#> Subgroup Utility Assessment
#> ===========================
#>
#> Group variable: region
#> Number of subgroups: 5
#> Threshold: 0.5
#>
#> Overall utility score: 0.9967
#> Worst subgroup score: 0.9847 (R2)
#> Worst / Overall ratio: 0.9880
#>
#> No subgroups flagged (all above threshold 0.5 ).The conservative utility_score is the worst subgroup
score. A ratio near 1 indicates homogeneous utility; below
0.5 indicates substantial disparity.
| Use case | Function | Data type | Interpretation |
|---|---|---|---|
| Quick assessment | propscore() | Mixed | < 0.1: good |
| Quick assessment | specks() | Mixed | < 0.05: good |
| Univariate | compare_wasserstein() | Numeric | Lower = better |
| Univariate | hellinger() | Categorical | < 0.1: good |
| Univariate | ci_proximity() | Numeric | > 0.8: good |
| Multivariate | energy_distance() | Numeric | Lower = better |
| Multivariate | mmd() | Numeric | Lower = better |
| Multivariate | copula_fidelity() | Numeric | < 0.1: good |
| Multivariate | contingency_fidelity() | Categorical | < 0.05: good |
| Predictive | tstr() | Mixed | ratio near 1: good |
| Predictive | regression_fidelity() | Mixed | overlap > 0.8: good |
| Predictive | compare_feature_importance() | Mixed | High corr: good |
| Subgroup | subgroup_utility() | Mixed | min > 0.5: good |
This section demonstrates the complete practitioner workflow on a realistic dataset, comparing three synthesis approaches with different privacy-utility trade-offs.
Consider a statistical agency that wants to release a survey dataset (\(n = 1000\)) containing demographic variables (age, sex, education, region) and a sensitive income variable.
set.seed(123)
N <- 1000
edu_levels <- c("Primary", "Secondary", "Tertiary")
age_groups <- c("20-29", "30-39", "40-49", "50-59", "60-69")
orig <- data.frame(
age_group = factor(sample(age_groups, N, replace = TRUE)),
sex = factor(sample(c("M", "F"), N, replace = TRUE)),
education = factor(sample(edu_levels, N, replace = TRUE,
prob = c(0.25, 0.50, 0.25))),
region = factor(sample(paste0("R", 1:4), N, replace = TRUE))
)
edu_effect <- c(Primary = 0, Secondary = 0.3, Tertiary = 0.7)
age_effect <- c("20-29" = 0, "30-39" = 0.15, "40-49" = 0.3,
"50-59" = 0.4, "60-69" = 0.35)
orig$income <- round(exp(
10 + age_effect[as.character(orig$age_group)] +
edu_effect[as.character(orig$education)] + rnorm(N, 0, 0.4)
))
qi <- c("age_group", "sex", "education", "region")
sens <- "income"We create three synthetic datasets spanning the privacy-utility spectrum:
set.seed(456)
# Method A: Independent marginals (safest, but destroys correlations)
synA <- data.frame(
age_group = factor(sample(age_groups, N, replace = TRUE)),
sex = factor(sample(c("M", "F"), N, replace = TRUE)),
education = factor(sample(edu_levels, N, replace = TRUE,
prob = c(0.25, 0.50, 0.25))),
region = factor(sample(paste0("R", 1:4), N, replace = TRUE)),
income = sample(orig$income, N, replace = TRUE)
)
# Method B: Category-preserving bootstrap with income noise
idx_B <- sample(N, N, replace = TRUE)
synB <- orig[idx_B, ]
rownames(synB) <- NULL
synB$income <- round(synB$income * exp(rnorm(N, 0, 0.15)))
swap_idx <- sample(N, round(0.2 * N))
synB$age_group[swap_idx] <- factor(sample(age_groups,
length(swap_idx), replace = TRUE))
# Method C: Near-copy with minimal perturbation (risky)
synC <- orig
synC$income <- round(synC$income * exp(rnorm(N, 0, 0.03)))The disclosure_report() function computes multiple risk
measures, evaluates each against a threshold, and produces a pass/fail
assessment:
pair_A <- synth_pair(orig, synA, key_vars = qi, target_var = sens)
pair_B <- synth_pair(orig, synB, key_vars = qi, target_var = sens)
pair_C <- synth_pair(orig, synC, key_vars = qi, target_var = sens)
rep_A <- disclosure_report(pair_A, compute = c("attribution", "privacy"),
seed = 42, verbose = FALSE)
rep_B <- disclosure_report(pair_B, compute = c("attribution", "privacy"),
seed = 42, verbose = FALSE)
rep_C <- disclosure_report(pair_C, compute = c("attribution", "privacy"),
seed = 42, verbose = FALSE)
verdicts <- data.frame(
Method = c("A: Independent", "B: Bootstrap+noise", "C: Near-copy"),
Overall = c(rep_A$overall_risk, rep_B$overall_risk, rep_C$overall_risk),
Pass = c(rep_A$n_pass, rep_B$n_pass, rep_C$n_pass),
Warn = c(rep_A$n_warn, rep_B$n_warn, rep_C$n_warn)
)
knitr::kable(verdicts, caption = "Quick risk screening across three methods.")| Method | Overall | Pass | Warn |
|---|---|---|---|
| A: Independent | HIGH | 6 | 5 |
| B: Bootstrap+noise | HIGH | 6 | 5 |
| C: Near-copy | HIGH | 6 | 6 |
Two patterns emerge. First, attribution metrics differentiate the three methods in the expected order: Method A (independent) has the lowest attribution risk, Method C (near-copy) the highest. Second, privacy models flag all three methods because they evaluate the released data’s structure alone — with 120 possible QI combinations and \(n = 1000\) records, some equivalence classes are small regardless of synthesis method. This illustrates a key lesson: privacy models and attribution metrics answer different questions and should be interpreted together.
The rumap() function implements the multivariate
Risk-Utility framework of Thees et al.
(2026). Traditional R-U analysis plots a single risk measure
against a single utility measure, producing a two-dimensional trade-off
curve. This can be misleading: a method may appear optimal on one pair
of measures while performing poorly on another. rumap()
computes multiple risk and utility measures simultaneously, normalizes
to \([0, 1]\), and identifies
Pareto-optimal methods.
set.seed(42)
ru <- rumap(orig,
list("A: Independent" = synA,
"B: Bootstrap+noise" = synB,
"C: Near-copy" = synC),
risk_measures = c("dcap", "tcap", "ims"),
utility_measures = c("pmse", "wasserstein"),
key_vars = qi, target_var = sens,
seed = 42)
print(ru)
#> Multivariate Risk-Utility Map (RU-Map)
#> ======================================
#>
#> Configuration:
#> Synthetic datasets (SDGs): 3
#> Risk measures: dcap, tcap, ims
#> Utility measures: pmse, wasserstein
#> Original data size: 1000
#>
#> Composite Scores:
#> SDG Risk Utility Pareto
#> A: Independent 0.6667 0.3548
#> B: Bootstrap+noise 0.2606 0.3444 *
#> C: Near-copy 0.3333 1.0000 *
#>
#> * = Pareto-optimal
#>
#> Pareto-optimal SDGs: B: Bootstrap+noise, C: Near-copyThe R-U scatterplot places each method in the composite risk-utility plane. Methods in the lower-right corner (low risk, high utility) are preferred.
The heatmap reveals why the methods differ. Method A achieves low risk across all measures but has poor utility. Method C has excellent utility but elevated attribution risk. Method B balances the two.
The analysis supports a structured decision:
This iterative workflow — screen with
disclosure_report(), compare with rumap(), and
refine synthesis parameters — is the core use case that riskutility is
designed to support.
The riskutility package makes five contributions to the R ecosystem for statistical disclosure control:
Comprehensive coverage. It is the first R package to unify all six risk assessment paradigms — frequency-based privacy models, attribution (CAP), ML-based (RAPID), distance-based, record linkage, and membership inference — under a single API.
Novel implementations. It provides the first R implementations of RAPID, two of the three GDPR failure criteria from Stadler et al. (2022) (singling out and linkability), \(t\)-closeness, DOMIAS density-based membership inference, and eight-method record linkage with bijective and optimal transport matching.
Unified API. The synth_pair
container and consistent S3 class pattern eliminate parameter repetition
and ensure practitioners can switch between risk measures without
learning new interfaces.
Multivariate R-U mapping. The
rumap() function implements the framework of Thees et al. (2026) for comparing multiple
synthesis approaches on multiple risk and utility dimensions
simultaneously.
Ecosystem integration. The
from_sdcMicro(), from_synthpop(), and
from_simPop() constructors allow practitioners to evaluate
data produced by any of the three main R packages.
The privacy-utility evaluation differs depending on the synthesis approach (Drechsler 2011):
If disclosure risk is too high:
If utility is too low:
subgroup_utility(), per-variable Hellinger)No formal privacy guarantees. All measures provide empirical risk assessment. A low DCAP score does not prove that no attacker can succeed. Empirical and formal approaches (differential privacy) are complementary.
Key variable selection. Results depend heavily on the choice of quasi-identifiers. Practitioners should base QI selection on a realistic threat model, not on convenience.
Threshold interpretation. The pass/fail thresholds
used by disclosure_report() are pragmatic defaults.
Different contexts require different thresholds.
We recommend the following minimal evaluation protocol:
disclosure_report() with
compute = "all" as a first screening.rumap() to compare multiple synthesis approaches and
identify Pareto-optimal methods.Four extensions are planned: (i) a Shiny dashboard for interactive evaluation; (ii) integration with differential privacy frameworks; (iii) computational optimizations for large datasets (\(n > 50{,}000\)); and (iv) population-level risk estimation from sample data.
All computations were performed using R 4.6.0 (R Core Team 2025) with the riskutility package
version 0.1.0. Core dependencies include data.table for efficient data
manipulation and ggplot2 for all visualizations. ML-based methods
require optional packages: ranger, rpart, xgboost, and caret, loaded
conditionally via requireNamespace().
| Metric | n=1000 | n=10000 | n=100000 | Scaling |
|---|---|---|---|---|
| dcap() | < 1 s | ~5 s | ~60 s | O(n*k) |
| dcr() | < 1 s | ~10 s | ~5 min | O(n^2) |
| kanonymity() | < 1 s | ~1 s | ~5 s | O(n log n) |
| energy_distance() | < 1 s | ~2 s | ~30 s | O(n^2) |
| mmd(method=‘rff’) | < 1 s | ~1 s | ~5 s | O(n*D) |
| propscore() | ~1 s | ~5 s | ~30 s | O(n*p) |
| rumap() | ~10 s | ~60 s | depends | Sum of components |
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] riskutility_0.1.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.6 pROC_1.19.0.1 rlang_1.2.0
#> [4] magrittr_2.0.5 otel_0.2.0 matrixStats_1.5.0
#> [7] e1071_1.7-17 compiler_4.6.0 vctrs_0.7.3
#> [10] reshape2_1.4.5 stringr_1.6.0 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 backports_1.5.1
#> [16] prodlim_2026.03.11 mlr3_1.7.1 purrr_1.2.2
#> [19] xfun_0.59 randomForest_4.7-1.2 cachem_1.1.0
#> [22] mlr3misc_0.22.0 jsonlite_2.0.0 recipes_1.3.3
#> [25] moocore_0.3.1 uuid_1.2-2 parallel_4.6.0
#> [28] R6_2.6.1 bslib_0.11.0 stringi_1.8.7
#> [31] vcd_1.4-13 RColorBrewer_1.1-3 ranger_0.18.0
#> [34] parallelly_1.47.0 car_3.1-5 boot_1.3-32
#> [37] rpart_4.1.27 lmtest_0.9-40 lubridate_1.9.5
#> [40] jquerylib_0.1.4 Rcpp_1.1.1-1.1 iterators_1.0.14
#> [43] knitr_1.51 future.apply_1.20.2 zoo_1.8-15
#> [46] Matrix_1.7-5 splines_4.6.0 nnet_7.3-20
#> [49] timechange_0.4.0 tidyselect_1.2.1 abind_1.4-8
#> [52] yaml_2.3.12 mlr3tuning_1.6.0 timeDate_4052.112
#> [55] codetools_0.2-20 listenv_1.0.0 lattice_0.22-9
#> [58] tibble_3.3.1 plyr_1.8.9 withr_3.0.3
#> [61] S7_0.2.2 evaluate_1.0.5 future_1.70.0
#> [64] survival_3.8-6 proxy_0.4-29 pillar_1.11.1
#> [67] carData_3.0-6 stats4_4.6.0 checkmate_2.3.4
#> [70] VIM_7.0.0 foreach_1.5.2 generics_0.1.4
#> [73] bbotk_1.10.1 sp_2.2-1 ggplot2_4.0.3
#> [76] scales_1.4.0 laeken_0.5.3 globals_0.19.1
#> [79] class_7.3-23 glue_1.8.1 maketools_1.3.2
#> [82] tools_4.6.0 mlr3pipelines_0.11.0 robustbase_0.99-7
#> [85] sys_3.4.3 data.table_1.18.4 ModelMetrics_1.2.2.2
#> [88] gower_1.0.2 buildtools_1.0.0 grid_4.6.0
#> [91] rbibutils_2.4.1 ipred_0.9-15 colorspace_2.1-2
#> [94] paradox_1.0.1 nlme_3.1-169 palmerpenguins_0.1.1
#> [97] Formula_1.2-5 cli_3.6.6 lava_1.9.1
#> [100] dplyr_1.2.1 gtable_0.3.6 DEoptimR_1.2-0
#> [103] sass_0.4.10 digest_0.6.39 caret_7.0-1
#> [106] lgr_0.5.2 farver_2.1.2 htmltools_0.5.9
#> [109] lifecycle_1.0.5 mlr3learners_0.15.0 hardhat_1.4.3
#> [112] MASS_7.3-65