riskutility: Comprehensive Disclosure Risk and Data Utility Assessment for Anonymized and Synthetic Data in R

Introduction

Statistical agencies and data custodians face a fundamental challenge: releasing data that are useful for analysis while protecting individual privacy. Two main approaches exist: (i) traditional anonymization methods such as perturbation, suppression, and generalization (Hundepool et al. 2012; Templ 2017), and (ii) synthetic data generation via statistical models (Nowok et al. 2016; Templ et al. 2017). Both require rigorous evaluation of disclosure risk and data utility — yet existing tools assess these two dimensions in isolation or cover only a narrow subset of the relevant metrics.

The R package sdcMicro (Templ et al. 2015) provides frequency-based risk estimation but limited utility assessment. synthpop (Nowok et al. 2016) offers the CAP disclosure metric and propensity-score utility but not distance-based or membership inference methods. In Python, SDMetrics covers distance-based metrics and Anonymeter implements GDPR failure criteria, but neither addresses attribution risk. No existing R package provides a unified framework that spans all risk families and supports multivariate comparison of multiple synthesis approaches; the most comprehensive evaluation suites (the Python libraries synthcity and SynthEval) sit outside the R ecosystem.

The riskutility package for R (R Core Team 2025) fills this gap. It implements over 30 risk measures spanning six paradigms — frequency-based privacy models, attribution-based (CAP), ML-based (RAPID), distance-based, record linkage, and membership inference — alongside more than a dozen utility functions covering global, distributional, structural, and predictive assessment. All functions share a consistent S3 API with print(), summary(), and plot() methods and accept a synth_pair container that bundles data and metadata.

The need for rigorous evaluation is well established in the statistical disclosure control literature and has gained urgency as synthetic data enters regulatory frameworks. The European Article 29 Working Party (Article 29 Data Protection Working Party 2014) identifies three criteria for anonymization: singling out, linkability, and inference. riskutility measures the first two directly (singling_out(), linkability()); the inference criterion is approached through the attribution-based CAP and RAPID measures rather than a dedicated inference-attack function. Importantly, synthetic data does not automatically satisfy these criteria (Stadler et al. 2022), making empirical assessment essential.

The distinction between general and specific utility (Snoke et al. 2018) is important: general utility measures (propensity scores, distributional distances) assess overall distributional fidelity, while specific utility measures (regression fidelity, TSTR) assess whether particular analyses yield similar results on original and synthetic data. A thorough evaluation should include both.

riskutility provides empirical privacy metrics rather than formal mathematical guarantees. Unlike differential privacy, which provides worst-case bounds, our metrics quantify observed risk on concrete datasets. This approach is appropriate for practical SDC workflows where the protection mechanism is not differentially private. For differentially private synthesizers, formal epsilon budgets should be used alongside empirical evaluation.

Quick start. A comprehensive risk assessment requires just a few lines:

library(riskutility)
pair <- synth_pair(original, synthetic,
                   key_vars = c("age", "sex", "region"),
                   target_var = "income")
report <- disclosure_report(pair)
print(report)

This single call computes attribution-based risk (DCAP, TCAP, WEAP, DiSCO, RAPID), distance-based risk (DCR, NNDR, IMS, dRisk, hitting rate), privacy models (\(k\)-anonymity, \(l\)-diversity, \(t\)-closeness), and membership inference measures (singling out, linkability, NNAA) — producing a pass/fail summary across all families. For comparing multiple synthesis approaches, rumap() computes risk and utility measures simultaneously, normalizes them to a common scale, and identifies Pareto-optimal methods (Section @ref(sec-case-rumap)).

Paper overview. Section @ref(sec-background) introduces the disclosure threat taxonomy and reviews related software. Section @ref(sec-design) describes the package architecture and the synth_pair container. Sections @ref(sec-risk) and @ref(sec-utility) present the risk and utility measures with mathematical formulations and worked examples. Section @ref(sec-comprehensive) demonstrates the complete practitioner workflow on a realistic case study, comparing three synthesis approaches with disclosure_report() and rumap(). Section @ref(sec-discussion) discusses limitations, remediation strategies, and future work. Section @ref(sec-computational) provides computational details.

Package scope at a glance. Both = applicable to traditionally anonymized and synthetic data.
Category Functions Paradigm Applicable to
Privacy models 8 Equivalence class Both
Attribution (CAP) 4 Matching Both
ML-based (RAPID) 4 Prediction Both
Distance-based 6 Nearest neighbor Both
Record linkage 1 (8 methods) Linkage Both
Membership inference 6 Attack simulation Both
Utility measures 15 Various Both
Frameworks 3 Composite Both

All analyses in this paper were conducted in R (R Core Team 2025).

Background: Threat Taxonomy and Related Software

Disclosure Threat Taxonomy

Every risk measure in riskutility addresses one or more of four disclosure threats. The classical SDC taxonomy (Hundepool et al. 2012; Templ 2017) recognizes three types: identity, attribute, and inferential disclosure. We extend this to four threats by separating membership disclosure (from the ML privacy literature, Shokri et al. (2017)) and memorization (relevant to generative models) as distinct categories, reflecting the broader scope of modern synthetic data evaluation:

  • Identity disclosure: An attacker links a released record to a specific individual. Measured by: recordLinkage(), kanonymity(), individual_risk(), hitting_rate().
  • Attribute disclosure: An attacker learns a sensitive attribute value through quasi-identifier (QI) matching. Measured by: dcap(), tcap(), weap(), disco(), rapid().
  • Membership disclosure: An attacker determines whether an individual’s data was used to create the released dataset. Measured by: mia_classifier(), domias(), nnaa(), singling_out(), delta_presence().
  • Memorization: A generative model reproduces training records verbatim or near-verbatim. Measured by: ims(), dcr(), nndr().
Disclosure threat taxonomy.
Threat Definition Key measures
Identity Attacker links record to individual recordLinkage, kanonymity, individual_risk
Attribute Attacker learns sensitive value via linkage dcap, tcap, weap, disco, rapid
Membership Attacker determines if individual is in dataset mia_classifier, domias, nnaa, singling_out
Memorization Generator reproduces training records ims, dcr, nndr

Risk Assessment Paradigms

riskutility implements six risk assessment paradigms, each addressing different threats from the taxonomy above. We summarize each here; full definitions and worked examples are in Section @ref(sec-risk).

Frequency-based (privacy models). These methods assess privacy properties of a single dataset based on quasi-identifier frequencies. \(k\)-Anonymity (Samarati and Sweeney 1998) requires each combination of quasi-identifiers to appear at least \(k\) times. \(l\)-Diversity (Machanavajjhala et al. 2007) and \(t\)-closeness (Li et al. 2007) progressively strengthen the protection of sensitive attributes within equivalence classes. See Section @ref(sec-privacy-models).

Attribution-based (CAP family). The Correct Attribution Probability framework (Taub et al. 2018) measures whether an attacker can infer sensitive attribute values by matching quasi-identifiers between original and released data. DCAP provides an aggregate measure; TCAP gives per-record risk scores. See Section @ref(sec-cap).

ML-based (RAPID). Risk of Attribute Prediction-Induced Disclosure (Thees et al. 2026) trains predictive models on the released data and evaluates them on the original. Accurate predictions indicate information leakage. RAPID captures complex non-linear relationships and variable interactions that the CAP matching approach may miss. See Section @ref(sec-rapid).

Distance-based. The holdout method compares distances from synthetic records to training data vs. holdout data. If synthetic records are systematically closer to training records, the generator has memorized rather than generalized. However, Yao et al. (2025) demonstrate that passing distance-based tests does not guarantee privacy (the “DCR Delusion”). See Section @ref(sec-distance).

Record linkage. Directly simulates a re-identification attack using deterministic (Gower distance), probabilistic (Fellegi-Sunter), PRAM-aware, predictive (propensity-score), or random forest linkage (Fellegi and Sunter 1969; Domingo-Ferrer and Torra 2003). See Section @ref(sec-recordlinkage).

Membership inference. Shadow model attacks (Shokri et al. 2017), density-based detection (Breugel et al. 2023), and GDPR failure criteria (Stadler et al. 2022) assess whether an attacker can determine if a specific individual’s data was used during synthesis. See Section @ref(sec-membership).

Utility Assessment Paradigms

Utility measures quantify how well the released data preserves the statistical properties of the original. We organize them into four groups (details in Section @ref(sec-utility)):

Global utility measures use propensity scores to quantify how distinguishable the released data is from the original (Woo et al. 2009; Snoke et al. 2018). A single number summarizes overall data quality.

Distributional utility measures (Wasserstein, Hellinger, KS test, energy distance, MMD) compare marginal or joint distributions, identifying which specific variables or relationships are poorly reproduced.

Structural utility measures (copula fidelity, contingency fidelity, PCA comparison, correlation matrix comparison) assess whether multivariate relationships — the correlations and interactions that analysts depend on — are preserved.

Predictive utility measures (TSTR, regression fidelity, feature importance stability) test whether models trained on released data generalize to real-world predictions.

Software Design and Architecture

Design Philosophy

Existing R packages for statistical disclosure control — sdcMicro (Templ et al. 2015), synthpop (Nowok et al. 2016), and simPop (Templ et al. 2017) — focus primarily on applying protection methods, with risk and utility assessment provided as secondary features. riskutility inverts this emphasis: it is a dedicated evaluation package, designed to be used after protection has been applied, regardless of which tool generated the released data.

The package follows four design principles:

  1. Consistent S3 architecture. Every major function returns a typed S3 object with print(), summary(), and plot() methods. We chose S3 over S4 classes for three reasons: lighter memory footprint, simpler method dispatch for the typical R user, and easier interoperability with data.table and ggplot2 objects. All risk/utility classes follow the same pattern, making the API predictable once one class is learned.

  2. Direction conventions. Risk measures are oriented so that higher values always indicate higher disclosure risk. Utility measures are oriented so that higher values indicate higher utility (better preservation of statistical properties). Some measures (e.g., pMSE, Wasserstein distance) naturally use a “lower is better” scale; their interpretation is noted in the documentation, and rumap() applies the necessary direction transformation when normalizing to \([0, 1]\).

  3. Minimal core dependencies. Frequency-based privacy models, CAP metrics, and distance-based measures require only base R, data.table, and ggplot2. ML-based methods require optional packages — ranger for random forests in RAPID, xgboost for gradient boosting, and rpart for classification trees. These are loaded conditionally via requireNamespace() and produce informative error messages when absent.

  4. Integration over competition. Rather than reimplementing synthesis or anonymization, the package provides from_synthpop(), from_simPop(), and from_sdcMicro() constructors that extract original and released data from objects created by these packages, wrapping them in the synth_pair container (Section @ref(sec-synth-pair)). This ensures users can evaluate any protection method with a single consistent interface.

The synth_pair Container

The central data structure in riskutility is the synth_pair object, which bundles original data, released data, variable roles, and metadata into a single container:

pair <- synth_pair(original, synthetic,
                   key_vars = c("age", "gender", "region"),
                   target_var = "income",
                   holdout = holdout_data)

The constructor stores the original and synthetic data frames alongside their dimensions, automatically detects categorical (cat_vars) and numeric (num_vars) columns, and retains user-specified quasi-identifiers (key_vars), sensitive attribute (target_var), and optional holdout data. This metadata eliminates a common source of error: specifying different key_vars for different risk measures on the same dataset.

Once constructed, every risk and utility function in the package accepts a synth_pair object as its first argument via S3 dispatch:

# All functions accept synth_pair --- no parameter repetition:
dcap(pair)                    # Attribution risk
rapid(pair, model_type = "rf") # ML-based risk
propscore(pair)                # Propensity score utility
disclosure_report(pair)        # Full risk report
rumap(pair)                    # Risk-Utility map

S3 Method Pattern

Every exported risk/utility class follows an identical three-part pattern:

# 1. Two equivalent calling conventions:
result <- dcap(pair)                                      # synth_pair method
result <- dcap(X, Y, key_vars = ..., target_var = ...)    # default method

# 2. Inspection:
print(result)                # One-screen summary with key statistic
s <- summary(result)         # Detailed statistics (returns summary.dcap)
print(s)                     # Formatted multi-line output

# 3. Visualization:
plot(result, which = 1)      # Plot type 1
plot(result, which = 1:2)    # Multiple plot types

The generic function dispatches via UseMethod(). The synth_pair method extracts original, synthetic, key_vars, and target_var from the container and delegates to the default method, which performs the actual computation. The return object is a list with a class attribute (e.g., "dcap"). The summary() method returns a typed summary object (e.g., "summary.dcap") with its own print() method, separating computation from display.

Plot methods use an integer which parameter to select among multiple visualization types. The number of available plot types varies by class, from one (simple measures) to seven (rumap).

Integration with the R Ecosystem

The from_* family of constructors bridges riskutility with the three main R packages for statistical disclosure control and synthetic data generation:

# From synthpop: pass synds object + original data
pair <- from_synthpop(synds_object, original_data,
                      key_vars = c("age", "sex"),
                      target_var = "income")

# From simPop: original data extracted automatically from simPopObj
pair <- from_simPop(simPopObj,
                    key_vars = c("age", "sex"),
                    target_var = "income")

# From sdcMicro: variable roles extracted from sdcMicroObj
pair <- from_sdcMicro(sdcMicroObj)

Each constructor returns a standard synth_pair object. from_sdcMicro() additionally extracts variable roles (quasi-identifiers, sensitive attributes, sample weights) from the sdcMicro S4 object, so that key_vars and target_var need not be specified manually. from_synthpop() supports multiple syntheses via the m parameter, selecting a specific synthetic dataset from a synds object. from_simPop() extracts sample weights when available, enabling weighted risk calculations.

Risk Measures

This section presents the six risk measure families, each illustrated with a worked example using the same running dataset.

set.seed(42)
n <- 500
original <- data.frame(
  age = sample(18:85, n, replace = TRUE),
  sex = factor(sample(c("M", "F"), n, replace = TRUE)),
  education = factor(sample(c("Primary", "Secondary", "Tertiary"), n,
                            replace = TRUE, prob = c(0.3, 0.5, 0.2))),
  region = factor(sample(paste0("R", 1:5), n, replace = TRUE)),
  income = round(rlnorm(n, log(40000), 0.5))
)

# Synthetic: independent draws (low risk expected)
synthetic <- data.frame(
  age = sample(18:85, n, replace = TRUE),
  sex = factor(sample(c("M", "F"), n, replace = TRUE)),
  education = factor(sample(c("Primary", "Secondary", "Tertiary"), n,
                            replace = TRUE, prob = c(0.3, 0.5, 0.2))),
  region = factor(sample(paste0("R", 1:5), n, replace = TRUE)),
  income = round(rlnorm(n, log(40000), 0.5))
)

key_vars <- c("age", "sex", "education", "region")
target_var <- "income"

pair <- synth_pair(original, synthetic,
                   key_vars = key_vars, target_var = target_var)

# Train/holdout split for distance-based metrics
set.seed(123)
train_idx <- sample(n, size = floor(0.7 * n))
train_data <- original[train_idx, ]
holdout_data <- original[-train_idx, ]

Privacy Models and Frequency-Based Risk

These methods assess privacy properties of a single dataset based on quasi-identifier frequencies. Originally developed for traditionally anonymized data, they apply equally to synthetic data. They do not compare original and released data; instead, they evaluate structural properties of the released data alone.

\(k\)-Anonymity (Samarati and Sweeney 1998) partitions records into equivalence classes (ECs) based on quasi-identifier values. The dataset satisfies \(k\)-anonymity if every EC contains at least \(k\) records: \(k = \min_i |\text{EC}(\mathbf{q}_i)|\), where \(\mathbf{q}_i\) is the quasi-identifier vector of record \(i\). Small ECs are vulnerable to identity disclosure because an attacker who knows a target’s quasi-identifiers can narrow them to fewer than \(k\) candidates. \(k\)-Anonymity protects against identity disclosure but not attribute disclosure: if all \(k\) records in a class share the same sensitive value, the attribute is trivially revealed.

\(l\)-Diversity (Machanavajjhala et al. 2007) strengthens \(k\)-anonymity by requiring that each EC contains at least \(l\) distinct values of the sensitive attribute. This prevents attribute disclosure even when all records in an EC share the same sensitive value (homogeneity attack).

\(t\)-Closeness (Li et al. 2007) further requires that the distribution of the sensitive attribute within each EC is close to its overall distribution. The distance is measured by the Earth Mover’s Distance (EMD), and \(t\) is the maximum EMD across all ECs.

# k-Anonymity: minimum equivalence class size
k_res <- riskutility::kanonymity(synthetic, key_vars = key_vars)
k_res
#> k-Anonymity Assessment
#> ======================
#> 
#> Key variables: age, sex, education, region 
#> Records: 500 | Equivalence classes: 430 
#> 
#> Results:
#>   Achieved k-anonymity level: 1 
#>   Target k: 5 
#>   Satisfies 5 -anonymity: NO 
#> 
#> Violations:
#>   Records violating k = 5 : 500 (100.0%) 
#>   Unique records (k=1): 366 
#> 
#> Risk Assessment:
#>   HIGH RISK: Unique records exist that can be directly identified.

# l-Diversity: sensitive attribute diversity per EC
l_res <- riskutility::ldiversity(synthetic, key_vars = key_vars,
                                 sensitive_var = target_var)
print(l_res)
#> l-Diversity Assessment
#> ======================
#> 
#> Key variables: age, sex, education, region 
#> Sensitive variable: income 
#> Records: 500 | Equivalence classes: 430 
#> 
#> Target l = 2 
#> 
#> Distinct l-Diversity:
#>   Achieved level: 1 
#>   Satisfies 2 -diversity: NO 
#>   Violating records: 366 (73.2%) 
#> 
#> Entropy l-Diversity:
#>   Achieved level: 1.00 
#>   Satisfies 2 -diversity: NO 
#>   Violating records: 366 (73.2%) 
#> 
#> Recursive (c,l)-Diversity (c = 2 ):
#>   Satisfied: NO

# t-Closeness: EMD between EC and overall distribution
t_res <- riskutility::tcloseness(synthetic, key_vars = key_vars,
                                 sensitive_var = target_var)
t_res
#> t-Closeness Assessment
#> ======================
#> 
#> Key variables: age, sex, education, region 
#> Sensitive variable: income (numeric) 
#> Records: 500 | Equivalence classes: 430 
#> 
#> Target t = 0.2 
#> 
#> Results:
#>   Maximum EMD: 0.5000 
#>   Satisfies t-closeness: NO 
#>   Violating records: 426 (85.2%) 
#> 
#> Interpretation:
#>    395 of 430 equivalence classes exceed the threshold.

With 68 unique age values, 2 sex levels, 3 education levels, and 5 regions, there are up to \(68 \times 2 \times 3 \times 5 = 2040\) possible QI combinations for only \(n = 500\) records. Most equivalence classes are singletons, yielding \(k = 1\) — this is expected for fine-grained quasi-identifiers and does not by itself indicate a problem with the synthetic data. The three models form a hierarchy: \(k\)-anonymity guards against identity disclosure, \(l\)-diversity against homogeneity attacks, and \(t\)-closeness against skewness attacks.

Additional frequency-based measures include individual_risk() for per-record re-identification probability based on EC frequencies (Franconi and Polettini 2004; Skinner and Elliot 2002), attacker_risk() for scenario-based assessment under prosecutor, journalist, and marketer attacker models (Hundepool et al. 2012), suda() for detecting records unique on small QI subsets, and population_uniqueness() for estimating population-level uniques via super-population models (Reiter 2005).

Privacy models overview.
Function Input Key output Threats
kanonymity() Single dataset Min EC size Identity
ldiversity() Single dataset Min distinct values per EC Attribute
tcloseness() Single dataset Max EMD across ECs Attribute
suda() Single dataset SUDA scores Identity
individual_risk() Single dataset Per-record frequency risk Identity
population_uniqueness() Single dataset Estimated pop. uniques Identity
attacker_risk() Single dataset Scenario-based risk Identity
epsilon_identifiability() Single dataset Identifiability fraction Identity

Attribution-Based Risk: The CAP Family

The Correct Attribution Probability framework (Taub et al. 2018) measures attribute disclosure: can an attacker infer a sensitive value by matching quasi-identifiers between original and released data?

For each original record \(i\) with quasi-identifier values \(\mathbf{q}_i\), the attacker finds all records in the released data whose quasi-identifiers match \(\mathbf{q}_i\) (the equivalence class). If the sensitive attribute is homogeneous within this class, the attacker learns the true value. The Targeted CAP (TCAP) gives each record a risk score between 0 and 1: \(\text{TCAP}_i = \Pr(\text{correct attribution} \mid \mathbf{q}_i)\). The mean CAP across all records is \(\overline{\text{CAP}} = n^{-1} \sum_i \text{CAP}_i\) (returned as cap). The Differential CAP subtracts the baseline (modal-class) attribution rate, \(\text{DCAP} = \overline{\text{CAP}} - \text{baseline}\) (returned as dcap; Taub et al. (2018)), so a value near zero indicates no attribution gain over random guessing. The summary() method also reports the risk ratio \(\overline{\text{CAP}} / \text{baseline}\) to contextualize the result.

The WEAP (Within-EC Attribution Probability) evaluates risk from the released data alone, without access to the original, making it suitable for data custodians who cannot share the original data with an auditor. DiSCO (Disclosive in Synthetic, Correct in Original) identifies records that are both confidently attributed in the released data and correctly attributed in the original.

# TCAP: per-record risk (most informative member of CAP family)
tcap_res <- tcap(pair)
summary(tcap_res)
#> Summary: Correct Attribution Probability (CAP) Analysis
#> =======================================================
#> Key variables: age, sex, education, region 
#> Target variable: income 
#> Method: exact 
#> 
#> CAP Statistics:
#>   Mean (CAPd): 0 
#>   Max: 0 
#>   Median: 0 
#>   SD: 0 
#>   Baseline: 0.002 
#>   Risk ratio: 0 
#> 
#> Certain Disclosure:
#>   Records at certain risk: 0 / 117 (0.0%) 
#>   (Key uniquely determines target in both original and synthetic)
#> 
#> CAP Distribution (quantiles):
#>   0%  25%  50%  75%  90% 100% 
#>    0    0    0    0    0    0 
#> 
#> Risk Categories:
#>   High risk (CAP >= 0.8): 0 (0.0%) 
#>   Medium risk (0.5 <= CAP < 0.8): 0 
#>   Low risk (CAP < 0.5): 117 
#> 
#> Matching Statistics:
#>   Records matched: 117 / 500 
#>   Records unmatched: 383 
#>   Avg matches per record: 1.2
plot(tcap_res)

Since our running example uses independently generated synthetic data (no relationship to the original), TCAP values should be close to the baseline attribution probability. Records with TCAP above 0.1 warrant closer inspection.

CAP family comparison with interpretation thresholds.
Metric Requires original? Per-record? Measures Low risk
DCAP Yes No Mean attribution probability ratio < 1.5
TCAP Yes Yes Individual attribution risk < 0.1 per record
WEAP No Yes Within-EC homogeneity < 0.1
DiSCO Yes Yes Correct + confident attribution < 5%

ML-Based Risk: RAPID

Risk of Attribute Prediction-Induced Disclosure (Thees et al. 2026) takes a fundamentally different approach to attribute disclosure. Instead of matching quasi-identifiers, RAPID trains a predictive model \(\hat{f}\) on the released data \((Y_{\mathcal{K}}, Y_s)\) and evaluates its predictions on the original data: \(\hat{s}_i = \hat{f}(X_{\mathcal{K},i})\). For numeric targets, a record is at risk when the prediction error falls below a threshold \(\epsilon\): \(e(s_i, \hat{s}_i) < \epsilon\), where \(e(\cdot, \cdot)\) is a configurable error metric (symmetric percentage error by default). The RAPID score is the fraction of at-risk records: \(\text{RAPID} = n^{-1} \sum_i \mathbf{1}(e(s_i, \hat{s}_i) < \epsilon)\). For categorical targets, a different evaluation applies: a record is at risk when a gain or ratio score exceeds a threshold.

rapid_res <- rapid(pair, model_type = "lm")
summary(rapid_res)
#> Summary: RAPID Disclosure Risk Assessment
#> ==========================================
#> 
#> Configuration:
#>   Model: lm 
#>   Method: symmetric_percentage 
#>   Threshold: 10 
#>   Key variables: age, sex, education, region 
#>   Target variable: income 
#> 
#> Risk Assessment:
#>   RAPID score: 0.1500 
#>   Records at risk: 75 / 500 (15.00%) 
#>   Risk level: HIGH 
#> 
#> Model Performance:
#>   MAE: 18392.4849 
#>   RMSE: 25265.2739 
#>   Relative MAE: 0.4022 
#>   Relative RMSE: 1.0065 
#> 
#> Interpretation Guidelines:
#>   High risk: Significant disclosure risk detected. Consider additional protection.
plot(rapid_res, which = c(1, 3))

With independently generated synthetic data, we expect the RAPID score to be close to the baseline. The threshold sensitivity plot (which = 3) shows how the at-risk fraction changes across threshold values.

RAPID complements the CAP family in two ways. First, it captures non-linear relationships and variable interactions. Second, it provides inferential tools: rapid_test() computes a permutation-based \(p\)-value, confint() provides bootstrap confidence intervals, and rapid_threshold_select() optimizes the threshold in a data-driven manner.

RAPID model backends.
Model Package Numeric Categorical Interactions
lm stats Yes No Manual
rf ranger Yes Yes Automatic
cart rpart Yes Yes Automatic
gbm xgboost Yes Yes Automatic
logit stats No Yes Manual
RAPID Score Risk Level Interpretation
< 0.05 Low ML model cannot predict target much better than baseline
0.05–0.15 Moderate Some predictive signal from synthetic data
0.15–0.30 Elevated Significant predictive leakage
> 0.30 High Strong evidence of disclosure risk

Distance-Based Risk

Distance-based methods detect memorization: the failure mode where a generative model reproduces training records verbatim or near-verbatim. The key idea is the holdout method — split the original data into a training set \(T\) (used for synthesis) and a holdout set \(H\) (unseen by the generator). For each synthetic record \(y_j\), compute the Distance to Closest Record in \(T\) (\(d_T\)) and in \(H\) (\(d_H\)). If the generator has generalized, \(d_T\) and \(d_H\) should be comparable. If it has memorized, \(d_T\) will be systematically smaller:

\[\text{DCR\_share} = n^{-1} \sum_j \mathbf{1}\bigl(d_T(y_j) < d_H(y_j)\bigr), \qquad \text{DCR\_ratio} = \frac{\bar{d}_T}{\bar{d}_H}\]

A DCR share meaningfully above 0.5 (the package flags shares above 0.55), or a ratio below about 1, suggests memorization. The NNDR (Nearest Neighbor Distance Ratio) provides a complementary view: for each synthetic record, it is the ratio of the distance to its nearest neighbor over the distance to its second-nearest neighbor. A ratio near 0 indicates a single dominant match; near 1 indicates no distinctive match. IMS (Identical Match Share) counts exact copies. When an explicit holdout is unavailable, holdout_fraction automatically splits the original data.

dcr_res <- dcr(pair, holdout_fraction = 0.2)
summary(dcr_res)
#> Summary: Distance to Closest Record (DCR)
#> ==========================================
#> WARNING: DCR has known limitations - see arXiv:2505.01524
#> 
#> Method: gower 
#> Variables: 5 
#> 
#> Dataset Sizes:
#>   Training: 400 | Holdout: 100 | Synthetic: 500 
#> 
#> Key Metrics:
#>   DCR ratio: 0.4869 (ideal: ~1.0)
#>   DCR share: 81.4% (ideal: ~50%)
#>   Privacy: WARNING 
#> 
#> Statistical Tests:
#>   Wilcoxon p-value: < 2.22e-16 
#>   Null comparison p-value: 0.5 
#>   Null share distribution: mean = 0.805 , 95% CI = [ 0.68 , 0.921 ]
#> 
#> DCR to Training:
#>   Mean: 0.022 | Median: 0.017 | SD: 0.0177 
#>   Quantiles:
#>     0%     5%    25%    50%    75%    95%   100% 
#> 0.0001 0.0040 0.0102 0.0170 0.0294 0.0570 0.1402 
#> 
#> DCR to Holdout:
#>   Mean: 0.0452 | Median: 0.0378 | SD: 0.0281 
#>   Quantiles:
#>     0%     5%    25%    50%    75%    95%   100% 
#> 0.0023 0.0106 0.0241 0.0378 0.0634 0.0993 0.2014 
#> 
#> Proximity Analysis:
#>   Closer to training: 407 (81.4%) 
#>   Closer to holdout: 93 (18.6%) 
#>   Identical to training (DCR=0): 0
plot(dcr_res, which = 1)

The DCR Delusion. Yao et al. (2025) show that DCR can fail to detect privacy leakage: datasets deemed “private” by DCR may still be vulnerable to membership inference attacks. Their central recommendation is that DCR be interpreted relative to a proper null distribution rather than in absolute terms; dcr() implements exactly this, comparing the observed share against a permutation null (null_test) and reporting a Wilcoxon test alongside the point estimate. Even so, distance-based metrics should always be complemented with other risk families.

Distance-based and proximity risk measures.
Metric Holdout Detects Low risk
DCR Yes Memorization share < 0.55
NNDR Yes Memorization share < 0.55
IMS No Exact copies < 0.01
RF proximity Yes Memorization (non-linear) ratio near 1
dRisk No Close records < 0.05
Hitting rate No Close records < 0.05
Epsilon ID No Identifiability < 0.01
Delta-presence No Membership bounds > 0.5

RF proximity offers a data-adaptive alternative: it trains a random forest to discriminate original from synthetic records and measures how often synthetic records share terminal nodes with training vs. holdout records, capturing non-linear proximity that fixed distance metrics may miss. Use rf_privacy() when complex interactions are expected.

Record Linkage Risk

The recordLinkage() function directly simulates a re-identification attack by linking each original record to the most similar record(s) in the anonymized dataset. Eight methods are implemented, spanning deterministic (Gower distance), probabilistic (Fellegi-Sunter, Fellegi and Sunter (1969)), PRAM-aware, predictive (propensity score), random forest proximity, rank-based (RBRL, Muralidhar and Domingo-Ferrer (2016)), robust Mahalanobis (Templ and Meindl 2008), and autoencoder embedding (Guo and Berkhahn 2016). Three matching modes are available: independent (many-to-one), bijective (one-to-one via Hungarian algorithm, Herranz et al. (2016)), and optimal transport (Sinkhorn).

For full details on all methods and matching modes, see ?recordLinkage.

Record linkage methods. All 3 = independent, bijective, OT.
Method Distance Mixed types Matching
Deterministic Gower Yes All 3
Probabilistic Fellegi-Sunter Yes All 3
PRAM Transition prob. Categorical All 3
Predictive Propensity Yes All 3
RF RF proximity Yes All 3
RBRL Rank-based Yes Independent
Mahalanobis Mahalanobis Numeric All 3
Embedding Autoencoder Yes All 3

Membership Inference and Anonymization Failure Criteria

This section groups two related but distinct concerns. The first three measures (mia_classifier(), domias(), nnaa()) assess membership disclosure — whether a membership inference attack (MIA) can determine if a specific individual’s data was used during synthesis. The singling out and linkability attacks operationalize two of the GDPR anonymization criteria (Article 29 Data Protection Working Party 2014), following the attack-based approach of Giomi et al. (2023).

NNAA (Nearest Neighbor Adversarial Accuracy, Yale et al. (2020)) is based on the adversarial accuracy of a nearest-neighbour two-sample comparison, \(\text{AA}(A,S) = \tfrac{1}{2}\bigl[\Pr(d_{AS} > d_{AA}) + \Pr(d_{SA} > d_{SS})\bigr]\). The reported privacy loss is \(\text{AA}(\text{holdout}, S) - \text{AA}(\text{train}, S)\); a positive value means synthetic records resemble training records more closely than holdout records, indicating memorization:

nnaa_res <- nnaa(train_data, synthetic, holdout = holdout_data,
                 method = "gower", seed = 42)
print(nnaa_res)
#> Nearest-Neighbor Adversarial Accuracy (NNAA)
#> =============================================
#> Method: gower 
#> Variables used: 5 
#> 
#> Dataset Sizes:
#>   Training records: 350 
#>   Holdout records: 150 
#>   Synthetic records: 500 
#> 
#> Adversarial Accuracy (ideal ~ 0.5):
#>   AA (train vs synth): 0.4599 
#>     Left  (real NN > synth NN): 0.3457 
#>     Right (synth NN > real NN): 0.574 
#>   AA (holdout vs synth): 0.4743 
#>     Left  (hold NN > synth NN): 0.2467 
#>     Right (synth NN > hold NN): 0.702 
#> 
#> Privacy Loss (ideal ~ 0):
#>   Privacy Loss: 0.0145  
#> 
#> Privacy Assessment: PASS
#>   No evidence of training data memorization.
Privacy Loss Interpretation
Near 0 No detectable leakage (ideal)
0.01–0.05 Minor leakage, likely acceptable
> 0.10 Significant memorization

Singling out and linkability operationalize two of the Article 29 Working Party’s three anonymization criteria (the third, inference, is addressed by the attribution-based CAP and RAPID measures):

so_res <- singling_out(original, synthetic,
                       n_attacks = 500, n_cols = 3,
                       mode = "multivariate", seed = 42)
print(so_res)
#> Singling Out Risk Assessment
#> ============================
#> Mode: multivariate | Columns per predicate: 3 
#> Variables used: 5 
#> 
#> Dataset Sizes:
#>   Original (training): 250 
#>   Holdout: 250 
#>   Synthetic: 500 
#> 
#> Attack Results (500 predicates):
#>   Singling out in original: 52 (10.4%) 
#>   Singling out in holdout: 48 (9.6%) 
#> 
#> Risk Score:
#>   Residual risk: 0.0088 [0.0000, 0.0655] (95% CI)
#> 
#> Privacy Assessment: PASS
#>   Singling out risk is within acceptable bounds (<= 0.1).

link_res <- linkability(original, synthetic,
                        n_attacks = 500, n_neighbors = 1, seed = 42)
print(link_res)
#> Linkability Risk Assessment
#> ===========================
#> Auxiliary columns: age, income 
#> Secret columns: sex, education, region 
#> Neighbors considered: 1 
#> 
#> Dataset Sizes:
#>   Original (training): 250 
#>   Holdout: 250 
#>   Synthetic: 500 
#> 
#> Attack Results (500 attacks):
#>   Successful links in original: 1 (0.2%) 
#>   Successful links in holdout: 3 (0.6%) 
#> 
#> Risk Score:
#>   Residual risk: 0 [0.0000, 0.0092] (95% CI)
#> 
#> Privacy Assessment: PASS
#>   Linkability risk is within acceptable bounds (<= 0.1).
Membership inference and GDPR measures.
Metric Attack type Holdout GDPR criterion Low risk
MIA classifier Shadow model Yes < 0.55
DOMIAS Density overfitting Yes < 0.6
NNAA Nearest neighbor Yes < 0.05
Singling out Predicate-based Yes Art. 29 WP < 0.1
Linkability Record linkage Yes Art. 29 WP < 0.1
delta-Presence Membership bounds No > 0.5

Cross-Family Comparison

No single metric tells the full story. Applying all families to the same dataset reveals complementary and sometimes contradictory information:

# Near-copy: original + small noise (high risk expected)
set.seed(99)
near_copy <- original
near_copy$age <- near_copy$age + sample(-1:1, n, replace = TRUE)
near_copy$income <- near_copy$income + round(rnorm(n, 0, 500))
pair_risky <- synth_pair(original, near_copy,
                         key_vars = key_vars, target_var = target_var)

# Compare key metrics across the two datasets
comparison <- data.frame(
  Metric = c("DCAP", "RAPID (lm)", "IMS"),
  Safe = c(
    dcap(pair)$dcap,
    rapid(pair, model_type = "lm", verbose = FALSE)$rapid,
    ims(pair)$ims
  ),
  Risky = c(
    dcap(pair_risky)$dcap,
    rapid(pair_risky, model_type = "lm", verbose = FALSE)$rapid,
    ims(pair_risky)$ims
  )
)
comparison$Safe <- round(comparison$Safe, 4)
comparison$Risky <- round(comparison$Risky, 4)
knitr::kable(comparison,
             caption = "Cross-family comparison: safe vs. risky synthetic data.")
Cross-family comparison: safe vs. risky synthetic data.
Metric Safe Risky
DCAP -0.0137 0.4863
RAPID (lm) 0.1500 0.1240
IMS 0.0000 0.0000

The near-copy shows elevated risk across all families, but the magnitude and interpretation differ. Attribution measures quantify information leakage; distance-based measures quantify memorization. These complementary perspectives mean that a dataset can pass one family’s tests while failing another’s — a thorough evaluation uses at least one measure from each family.

Data Utility Measures

A dataset that passes all risk checks but destroys the analytical value of the data is useless. Utility measures quantify how well the released data preserves the statistical properties of the original.

Global Utility: Propensity Scores

Global utility measures give a single-number verdict by asking: can a classifier tell original and synthetic records apart?

The propensity score method (Woo et al. 2009; Snoke et al. 2018) pools original (\(X\), \(n_X\) records) and synthetic (\(Y\), \(n_Y\) records) data, labels them (0/1), and fits a classifier. The pMSE (propensity score Mean Squared Error) measures how well the model discriminates:

\[\text{pMSE} = \frac{1}{N} \sum_{i=1}^{N} \left(\hat{p}_i - c\right)^2\]

where \(N = n_X + n_Y\) and \(c = n_Y / N\). If original and synthetic records are indistinguishable, pMSE \(\approx 0\).

prop_res <- propscore(pair)
summary(prop_res)
#> Propensity Score Utility Summary
#> ================================
#> Method:           rf 
#> Sample sizes:     n_original = 500 , n_synthetic = 500 
#> Class ratio (cr): 0.5 
#> 
#> Propensity Score Statistic (pMSE): 0.02519 
#> PS ratio (below/above cr):        0.8868 
#> 
#> Mean propensity (original):  0.4926 
#> Mean propensity (synthetic): 0.4853 
#> 
#> 
#> Density diagnostics:
#>   KL divergence:               0.005848 
#>   KL divergence (Bayes space): 0.005803 
#>   Mean density ratio:          1.0014  (sd: 0.0034 )
#>   Mean density ratio (Bayes):  0.0136  (sd: 0.0034 )
pMSE Value Interpretation
< 0.01 Excellent fidelity
0.01–0.05 Good fidelity
0.05–0.10 Moderate differences
> 0.10 Poor fidelity

Univariate Diagnostics

When global utility is poor, per-variable measures identify which variables are responsible. For numeric variables, the Wasserstein distance measures the cost of transforming one distribution into another. For categorical variables, the Hellinger distance measures distributional overlap:

\[H(p, q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{k=1}^{K} \left(\sqrt{p_k} - \sqrt{q_k}\right)^2}\]

# Hellinger distance for categorical variables
h_res <- hellinger(original, synthetic, vars = c("sex", "education"))
print(h_res)
#> Hellinger Distance - Categorical Distribution Comparison
#> =========================================================
#> 
#> Dataset Sizes:
#>   Original (X): 500 | Synthetic (Y): 500 
#>   Variables compared: 2 
#> 
#> Summary:
#>   Mean Hellinger distance: 0.0283 
#>   Max Hellinger distance:  0.0467 
#>   Min Hellinger distance:  0.0098 
#>   Utility score (1-mean): 0.9717 
#> 
#> Interpretation:
#>   EXCELLENT: Categorical distributions are very similar.

# CI proximity: confidence interval overlap for means
cip_res <- ci_proximity(original, synthetic, vars = c("age", "income"))
print(cip_res)
#> Confidence Interval Proximity - Statistical Inference Preservation
#> ===================================================================
#> 
#> Configuration:
#>   Confidence level: 95 %
#>   Variables compared: 2 
#>   Sample sizes: X = 500 , Y = 500 
#> 
#> Summary Metrics:
#>   Mean proximity score:     0.7936 (1 = perfect)
#>   Mean CI overlap:          0.6075 (1 = complete)
#>   Mean relative error:      0.0212 (0 = perfect)
#>   CIs containing orig mean: 50.0% 
#> 
#> Interpretation:
#>   MODERATE: Some degradation of statistical properties.

The CI proximity measure (Karr et al. 2006) compares confidence intervals of summary statistics (means) between original and synthetic data. An overlap near 1 means the intervals coincide; a relative error near 0 means point estimates are close.

Multivariate and Structural Utility

Marginal distributions can match perfectly while joint distributions diverge. The energy distance (Székely and Rizzo 2013) is a multivariate two-sample statistic sensitive to differences in both location and scale (lower values indicate closer joint distributions):

e_res <- energy_distance(original[, c("age", "income")],
                         synthetic[, c("age", "income")],
                         seed = 42)
print(e_res)
#> Energy Distance - Multivariate Numeric Distribution Comparison
#> ===============================================================
#> 
#> Dataset Sizes:
#>   Original (X): 500 
#>   Synthetic (Y): 500 
#>   Variables: 2 
#>   Standardized: TRUE 
#> 
#> Energy Distance:
#>   Raw:        0.0033 
#>   Normalized: 0.0019 
#>   Utility:    0.9967 (exp(-E), higher=better)
#> 
#> Distance Components:
#>   Mean dist(X,Y): 1.7317 
#>   Mean dist(X,X): 1.6917 
#>   Mean dist(Y,Y): 1.7685 
#> 
#> Interpretation:
#>   EXCELLENT: Multivariate distributions are very similar.

The MMD (Maximum Mean Discrepancy, Gretton et al. (2012)) provides a kernel-based alternative supporting exact computation and random Fourier features (RFF) for large datasets:

mmd_res <- mmd(original[, c("age", "income")],
               synthetic[, c("age", "income")],
               kernel = "gaussian", method = "rff",
               n_features = 500, seed = 42)
print(mmd_res)
#> Maximum Mean Discrepancy (MMD)
#> ==============================
#> 
#> Dataset Sizes:
#>   Original (X): 500 
#>   Synthetic (Y): 500 
#>   Variables: 2 
#>   Standardized: TRUE 
#> 
#> Settings:
#>   Kernel: gaussian 
#>   Method: rff 
#>   Features: 500 
#>   Bandwidth (sigma): 1.6006 
#> 
#> Results:
#>   MMD^2:     0.002979 
#>   Utility:   0.9988 (exp(-MMD^2/sigma^2), higher=better)
#> 
#> Interpretation:
#>   EXCELLENT: Distributions are very similar.

Copula fidelity compares the empirical copula (rank dependence structure) using the Cramér-von Mises statistic on pairwise copula CDFs. Contingency fidelity (Snoke et al. 2018) is its categorical complement, computing total variation distance between bivariate contingency tables:

cop_res <- copula_fidelity(original, synthetic, vars = c("age", "income"))
print(cop_res)
#> Copula Fidelity - Empirical Copula Dependence Comparison
#> ========================================================
#> 
#> Dataset Sizes:
#>   Original (X): 500 
#>   Synthetic (Y): 500 
#>   Variables: 2 
#>   Grid resolution: 50 
#> 
#> Results:
#>   Mean CvM distance: 0.000124 
#>   Utility score:     0.9877 (1/(1+CvM*100), higher=better)
#> 
#> Pairwise CvM Distances:
#>   age          vs income      : 0.000124
#> 
#> Interpretation:
#>   EXCELLENT: Dependence structure is very well preserved.

ctf_res <- contingency_fidelity(original, synthetic,
                                vars = c("sex", "education", "region"))
print(ctf_res)
#> Contingency Fidelity - Categorical Dependence Comparison
#> ========================================================
#> 
#> Dataset Sizes:
#>   Original (X): 500 
#>   Synthetic (Y): 500 
#>   Categorical variables: 3 
#> 
#> Results:
#>   Mean TV distance: 0.074667 
#>   Utility score:    0.9253 (1 - mean_tv, higher=better)
#> 
#> Pairwise TV Distances (3/3 pairs):
#>   sex          vs education   : 0.066000
#>   sex          vs region      : 0.084000
#>   education    vs region      : 0.074000
#> 
#> Interpretation:
#>   GOOD: Categorical dependence structure is reasonably preserved.

Predictive Utility

TSTR (Train on Synthetic, Test on Real, Zhao et al. (2021)) trains a predictive model on the synthetic data and evaluates performance on held-out real data. The ratio of TSTR-to-TRTR performance quantifies how well predictive relationships are preserved:

set.seed(42)
tstr_res <- tstr(pair, target_var = "income", model = "rf",
                 test_fraction = 0.3, seed = 42)
print(tstr_res)
#> Train on Synthetic, Test on Real (TSTR)
#> ========================================
#> 
#> Dataset Sizes:
#>   Original training: 350 
#>   Original test:     150 
#>   Synthetic:         500 
#>   Test fraction:     30.0% 
#> 
#> Settings:
#>   Model:     rf 
#>   Target(s): income 
#>   Metric:    R2 
#> 
#> Results:
#>   TRTR performance: -0.0115 (baseline)
#>   TSTR performance: -0.0196 
#>   TSTR ratio:       1.6994 
#>   Utility score:    1.0000 (higher = better)
#> 
#> Interpretation:
#>   EXCELLENT: Synthetic data preserves predictive structure very well.

Regression fidelity (Karr et al. 2006) fits the same regression model on both datasets and compares coefficient estimates via CI overlap, standardized bias, and significance agreement:

reg_res <- regression_fidelity(original, synthetic,
                               formula = income ~ age + sex + education)
summary(reg_res)
#> Summary: Regression Fidelity
#> ============================
#> 
#> Formula: income ~ age + sex + education 
#> Model:   lm  | Conf. level: 0.95 
#> Samples: X = 500 , Y = 500 
#> 
#> Coefficient Comparison:
#>  Term               Est.Orig   Est.Synth  Bias       Std.Bias CI.Overlap
#>  (Intercept)        40156.3727 45857.2451 5700.8724  1.5177   0.6045    
#>  age                70.3695    -1.0538    -71.4233   -1.1833  0.6887    
#>  sexM               47.8233    2994.6773  2946.8541  1.3061   0.6686    
#>  educationSecondary 2162.0872  -3140.4233 -5302.5105 -2.0300  0.4815    
#>  educationTertiary  4343.8707  574.6118   -3769.2589 -1.1864  0.6971    
#>  Sig.Agree
#>  yes      
#>  yes      
#>  yes      
#>  yes      
#>  yes      
#> 
#> Summary Statistics:
#>   Utility score (mean CI overlap): 0.6281 
#>   Mean |standardized bias|:        1.4447 
#>   Significance agreement rate:     100.0%
plot(reg_res, which = 1)

Tail fidelity assesses how well extreme values are preserved — critical for applications where tail behavior matters (financial risk, rare diseases):

tail_res <- tail_fidelity(original, synthetic, vars = c("age", "income"),
                          percentile = 95, tails = "both")
print(tail_res)
#> Tail Fidelity - Tail Preservation Utility Measure
#> ==================================================
#> 
#> Dataset Sizes:
#>   Original (X): 500 
#>   Synthetic (Y): 500 
#>   Numeric variables: 2 
#> 
#> Settings:
#>   Percentile: 95 
#>   Tails: both 
#>   Hill estimator: FALSE 
#> 
#> Results:
#>   QQ tail divergence: 0.0481 
#>   Utility score:      0.9531 (exp(-QQ_div), higher=better)
#> 
#> Per-variable QQ tail divergence:
#>   age                  qq=0.0250  jsd=-19.7770
#>   income               qq=0.0712  jsd=-14.6344
#> 
#> Interpretation:
#>   EXCELLENT: Tail distributions are very well preserved.

Subgroup utility (Snoke et al. 2018) applies any utility measure to each subgroup defined by a grouping variable, identifying groups with low utility:

su_res <- subgroup_utility(original, synthetic, group_var = "region",
                           utility_fun = energy_distance,
                           threshold = 0.5, seed = 42)
print(su_res)
#> Subgroup Utility Assessment
#> ===========================
#> 
#> Group variable: region 
#> Number of subgroups: 5 
#> Threshold: 0.5 
#> 
#> Overall utility score: 0.9967 
#> Worst subgroup score: 0.9847 (R2)
#> Worst / Overall ratio: 0.9880 
#> 
#> No subgroups flagged (all above threshold 0.5 ).

The conservative utility_score is the worst subgroup score. A ratio near 1 indicates homogeneous utility; below 0.5 indicates substantial disparity.

Utility measures by use case.
Use case Function Data type Interpretation
Quick assessment propscore() Mixed < 0.1: good
Quick assessment specks() Mixed < 0.05: good
Univariate compare_wasserstein() Numeric Lower = better
Univariate hellinger() Categorical < 0.1: good
Univariate ci_proximity() Numeric > 0.8: good
Multivariate energy_distance() Numeric Lower = better
Multivariate mmd() Numeric Lower = better
Multivariate copula_fidelity() Numeric < 0.1: good
Multivariate contingency_fidelity() Categorical < 0.05: good
Predictive tstr() Mixed ratio near 1: good
Predictive regression_fidelity() Mixed overlap > 0.8: good
Predictive compare_feature_importance() Mixed High corr: good
Subgroup subgroup_utility() Mixed min > 0.5: good

Comprehensive Assessment: A Case Study

This section demonstrates the complete practitioner workflow on a realistic dataset, comparing three synthesis approaches with different privacy-utility trade-offs.

Scenario and Data

Consider a statistical agency that wants to release a survey dataset (\(n = 1000\)) containing demographic variables (age, sex, education, region) and a sensitive income variable.

set.seed(123)
N <- 1000
edu_levels <- c("Primary", "Secondary", "Tertiary")
age_groups <- c("20-29", "30-39", "40-49", "50-59", "60-69")
orig <- data.frame(
  age_group = factor(sample(age_groups, N, replace = TRUE)),
  sex = factor(sample(c("M", "F"), N, replace = TRUE)),
  education = factor(sample(edu_levels, N, replace = TRUE,
                            prob = c(0.25, 0.50, 0.25))),
  region = factor(sample(paste0("R", 1:4), N, replace = TRUE))
)
edu_effect <- c(Primary = 0, Secondary = 0.3, Tertiary = 0.7)
age_effect <- c("20-29" = 0, "30-39" = 0.15, "40-49" = 0.3,
                "50-59" = 0.4, "60-69" = 0.35)
orig$income <- round(exp(
  10 + age_effect[as.character(orig$age_group)] +
    edu_effect[as.character(orig$education)] + rnorm(N, 0, 0.4)
))

qi <- c("age_group", "sex", "education", "region")
sens <- "income"

We create three synthetic datasets spanning the privacy-utility spectrum:

set.seed(456)

# Method A: Independent marginals (safest, but destroys correlations)
synA <- data.frame(
  age_group = factor(sample(age_groups, N, replace = TRUE)),
  sex = factor(sample(c("M", "F"), N, replace = TRUE)),
  education = factor(sample(edu_levels, N, replace = TRUE,
                            prob = c(0.25, 0.50, 0.25))),
  region = factor(sample(paste0("R", 1:4), N, replace = TRUE)),
  income = sample(orig$income, N, replace = TRUE)
)

# Method B: Category-preserving bootstrap with income noise
idx_B <- sample(N, N, replace = TRUE)
synB <- orig[idx_B, ]
rownames(synB) <- NULL
synB$income <- round(synB$income * exp(rnorm(N, 0, 0.15)))
swap_idx <- sample(N, round(0.2 * N))
synB$age_group[swap_idx] <- factor(sample(age_groups,
                                          length(swap_idx), replace = TRUE))

# Method C: Near-copy with minimal perturbation (risky)
synC <- orig
synC$income <- round(synC$income * exp(rnorm(N, 0, 0.03)))

Step 1: Quick Risk Screening with disclosure_report()

The disclosure_report() function computes multiple risk measures, evaluates each against a threshold, and produces a pass/fail assessment:

pair_A <- synth_pair(orig, synA, key_vars = qi, target_var = sens)
pair_B <- synth_pair(orig, synB, key_vars = qi, target_var = sens)
pair_C <- synth_pair(orig, synC, key_vars = qi, target_var = sens)

rep_A <- disclosure_report(pair_A, compute = c("attribution", "privacy"),
                           seed = 42, verbose = FALSE)
rep_B <- disclosure_report(pair_B, compute = c("attribution", "privacy"),
                           seed = 42, verbose = FALSE)
rep_C <- disclosure_report(pair_C, compute = c("attribution", "privacy"),
                           seed = 42, verbose = FALSE)

verdicts <- data.frame(
  Method = c("A: Independent", "B: Bootstrap+noise", "C: Near-copy"),
  Overall = c(rep_A$overall_risk, rep_B$overall_risk, rep_C$overall_risk),
  Pass = c(rep_A$n_pass, rep_B$n_pass, rep_C$n_pass),
  Warn = c(rep_A$n_warn, rep_B$n_warn, rep_C$n_warn)
)
knitr::kable(verdicts, caption = "Quick risk screening across three methods.")
Quick risk screening across three methods.
Method Overall Pass Warn
A: Independent HIGH 6 5
B: Bootstrap+noise HIGH 6 5
C: Near-copy HIGH 6 6

Two patterns emerge. First, attribution metrics differentiate the three methods in the expected order: Method A (independent) has the lowest attribution risk, Method C (near-copy) the highest. Second, privacy models flag all three methods because they evaluate the released data’s structure alone — with 120 possible QI combinations and \(n = 1000\) records, some equivalence classes are small regardless of synthesis method. This illustrates a key lesson: privacy models and attribution metrics answer different questions and should be interpreted together.

Step 2: Comparative Assessment with rumap()

The rumap() function implements the multivariate Risk-Utility framework of Thees et al. (2026). Traditional R-U analysis plots a single risk measure against a single utility measure, producing a two-dimensional trade-off curve. This can be misleading: a method may appear optimal on one pair of measures while performing poorly on another. rumap() computes multiple risk and utility measures simultaneously, normalizes to \([0, 1]\), and identifies Pareto-optimal methods.

set.seed(42)
ru <- rumap(orig,
            list("A: Independent" = synA,
                 "B: Bootstrap+noise" = synB,
                 "C: Near-copy" = synC),
            risk_measures = c("dcap", "tcap", "ims"),
            utility_measures = c("pmse", "wasserstein"),
            key_vars = qi, target_var = sens,
            seed = 42)
print(ru)
#> Multivariate Risk-Utility Map (RU-Map)
#> ======================================
#> 
#> Configuration:
#>   Synthetic datasets (SDGs): 3 
#>   Risk measures:    dcap, tcap, ims 
#>   Utility measures: pmse, wasserstein 
#>   Original data size: 1000 
#> 
#> Composite Scores:
#>                 SDG   Risk Utility Pareto
#>      A: Independent 0.6667  0.3548       
#>  B: Bootstrap+noise 0.2606  0.3444      *
#>        C: Near-copy 0.3333  1.0000      *
#> 
#> * = Pareto-optimal
#> 
#> Pareto-optimal SDGs: B: Bootstrap+noise, C: Near-copy
plot(ru, which = 1)  # R-U scatterplot with Pareto front

The R-U scatterplot places each method in the composite risk-utility plane. Methods in the lower-right corner (low risk, high utility) are preferred.

plot(ru, which = 2)  # Heatmap of individual measures

The heatmap reveals why the methods differ. Method A achieves low risk across all measures but has poor utility. Method C has excellent utility but elevated attribution risk. Method B balances the two.

Step 3: Decision

The analysis supports a structured decision:

  • Method A is appropriate when data is released to the general public and any re-identification would be unacceptable.
  • Method B is appropriate for controlled-access research environments where moderate risk is acceptable.
  • Method C should be rejected — its risk profile is too close to the original data.

This iterative workflow — screen with disclosure_report(), compare with rumap(), and refine synthesis parameters — is the core use case that riskutility is designed to support.

Summary and Discussion

Contributions

The riskutility package makes five contributions to the R ecosystem for statistical disclosure control:

  1. Comprehensive coverage. It is the first R package to unify all six risk assessment paradigms — frequency-based privacy models, attribution (CAP), ML-based (RAPID), distance-based, record linkage, and membership inference — under a single API.

  2. Novel implementations. It provides the first R implementations of RAPID, two of the three GDPR failure criteria from Stadler et al. (2022) (singling out and linkability), \(t\)-closeness, DOMIAS density-based membership inference, and eight-method record linkage with bijective and optimal transport matching.

  3. Unified API. The synth_pair container and consistent S3 class pattern eliminate parameter repetition and ensure practitioners can switch between risk measures without learning new interfaces.

  4. Multivariate R-U mapping. The rumap() function implements the framework of Thees et al. (2026) for comparing multiple synthesis approaches on multiple risk and utility dimensions simultaneously.

  5. Ecosystem integration. The from_sdcMicro(), from_synthpop(), and from_simPop() constructors allow practitioners to evaluate data produced by any of the three main R packages.

Partially vs Fully Synthetic Data

The privacy-utility evaluation differs depending on the synthesis approach (Drechsler 2011):

  • Fully synthetic data: All records are synthetic; all risk metrics are directly applicable.
  • Partially synthetic data: Only sensitive values are replaced. Attribution metrics (DCAP, RAPID) are particularly relevant because original quasi-identifiers provide a direct matching key. Distance-based metrics should focus on the synthesized variables.
  • Multiple synthetic datasets: When \(m > 1\) datasets are generated, evaluate each separately and report the worst-case risk across all \(m\) releases.

Remediation

If disclosure risk is too high:

  1. Add more noise to the synthesis process
  2. Reduce granularity of quasi-identifiers (e.g., age groups instead of exact age)
  3. Apply additional anonymization techniques (suppression, generalization)
  4. Re-synthesize with different model settings
  5. Re-evaluate with riskutility — iterate until risk-utility balance is acceptable

If utility is too low:

  1. Use a more flexible synthesizer (CART, Bayesian network, GAN)
  2. Reduce the amount of perturbation
  3. Identify affected variables/subgroups (subgroup_utility(), per-variable Hellinger)
  4. Consider partially synthetic data (synthesize only sensitive variables)

Limitations and Recommendations

No formal privacy guarantees. All measures provide empirical risk assessment. A low DCAP score does not prove that no attacker can succeed. Empirical and formal approaches (differential privacy) are complementary.

Key variable selection. Results depend heavily on the choice of quasi-identifiers. Practitioners should base QI selection on a realistic threat model, not on convenience.

Threshold interpretation. The pass/fail thresholds used by disclosure_report() are pragmatic defaults. Different contexts require different thresholds.

We recommend the following minimal evaluation protocol:

  1. Always run disclosure_report() with compute = "all" as a first screening.
  2. For publication-quality assessment, use rumap() to compare multiple synthesis approaches and identify Pareto-optimal methods.
  3. For regulatory compliance (GDPR), include singling out and linkability tests.
  4. Interpret distance-based metrics cautiously. Following Yao et al. (2025), do not rely on DCR/NNDR alone.

Future Work

Four extensions are planned: (i) a Shiny dashboard for interactive evaluation; (ii) integration with differential privacy frameworks; (iii) computational optimizations for large datasets (\(n > 50{,}000\)); and (iv) population-level risk estimation from sample data.

Computational Details

All computations were performed using R 4.6.0 (R Core Team 2025) with the riskutility package version 0.1.0. Core dependencies include data.table for efficient data manipulation and ggplot2 for all visualizations. ML-based methods require optional packages: ranger, rpart, xgboost, and caret, loaded conditionally via requireNamespace().

Approximate runtimes for key metrics.
Metric n=1000 n=10000 n=100000 Scaling
dcap() < 1 s ~5 s ~60 s O(n*k)
dcr() < 1 s ~10 s ~5 min O(n^2)
kanonymity() < 1 s ~1 s ~5 s O(n log n)
energy_distance() < 1 s ~2 s ~30 s O(n^2)
mmd(method=‘rff’) < 1 s ~1 s ~5 s O(n*D)
propscore() ~1 s ~5 s ~30 s O(n*p)
rumap() ~10 s ~60 s depends Sum of components
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] riskutility_0.1.0 rmarkdown_2.31   
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rdpack_2.6.6         pROC_1.19.0.1        rlang_1.2.0         
#>   [4] magrittr_2.0.5       otel_0.2.0           matrixStats_1.5.0   
#>   [7] e1071_1.7-17         compiler_4.6.0       vctrs_0.7.3         
#>  [10] reshape2_1.4.5       stringr_1.6.0        pkgconfig_2.0.3     
#>  [13] crayon_1.5.3         fastmap_1.2.0        backports_1.5.1     
#>  [16] prodlim_2026.03.11   mlr3_1.7.1           purrr_1.2.2         
#>  [19] xfun_0.59            randomForest_4.7-1.2 cachem_1.1.0        
#>  [22] mlr3misc_0.22.0      jsonlite_2.0.0       recipes_1.3.3       
#>  [25] moocore_0.3.1        uuid_1.2-2           parallel_4.6.0      
#>  [28] R6_2.6.1             bslib_0.11.0         stringi_1.8.7       
#>  [31] vcd_1.4-13           RColorBrewer_1.1-3   ranger_0.18.0       
#>  [34] parallelly_1.47.0    car_3.1-5            boot_1.3-32         
#>  [37] rpart_4.1.27         lmtest_0.9-40        lubridate_1.9.5     
#>  [40] jquerylib_0.1.4      Rcpp_1.1.1-1.1       iterators_1.0.14    
#>  [43] knitr_1.51           future.apply_1.20.2  zoo_1.8-15          
#>  [46] Matrix_1.7-5         splines_4.6.0        nnet_7.3-20         
#>  [49] timechange_0.4.0     tidyselect_1.2.1     abind_1.4-8         
#>  [52] yaml_2.3.12          mlr3tuning_1.6.0     timeDate_4052.112   
#>  [55] codetools_0.2-20     listenv_1.0.0        lattice_0.22-9      
#>  [58] tibble_3.3.1         plyr_1.8.9           withr_3.0.3         
#>  [61] S7_0.2.2             evaluate_1.0.5       future_1.70.0       
#>  [64] survival_3.8-6       proxy_0.4-29         pillar_1.11.1       
#>  [67] carData_3.0-6        stats4_4.6.0         checkmate_2.3.4     
#>  [70] VIM_7.0.0            foreach_1.5.2        generics_0.1.4      
#>  [73] bbotk_1.10.1         sp_2.2-1             ggplot2_4.0.3       
#>  [76] scales_1.4.0         laeken_0.5.3         globals_0.19.1      
#>  [79] class_7.3-23         glue_1.8.1           maketools_1.3.2     
#>  [82] tools_4.6.0          mlr3pipelines_0.11.0 robustbase_0.99-7   
#>  [85] sys_3.4.3            data.table_1.18.4    ModelMetrics_1.2.2.2
#>  [88] gower_1.0.2          buildtools_1.0.0     grid_4.6.0          
#>  [91] rbibutils_2.4.1      ipred_0.9-15         colorspace_2.1-2    
#>  [94] paradox_1.0.1        nlme_3.1-169         palmerpenguins_0.1.1
#>  [97] Formula_1.2-5        cli_3.6.6            lava_1.9.1          
#> [100] dplyr_1.2.1          gtable_0.3.6         DEoptimR_1.2-0      
#> [103] sass_0.4.10          digest_0.6.39        caret_7.0-1         
#> [106] lgr_0.5.2            farver_2.1.2         htmltools_0.5.9     
#> [109] lifecycle_1.0.5      mlr3learners_0.15.0  hardhat_1.4.3       
#> [112] MASS_7.3-65

References

Article 29 Data Protection Working Party. 2014. Opinion 05/2014 on Anonymisation Techniques. Article 29 Data Protection Working Party.
Breugel, Boris van, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. DOMIAS: Membership Inference Attacks Against Synthetic Data Through Overfitting Detection.” International Conference on Artificial Intelligence and Statistics (AISTATS).
Domingo-Ferrer, Josep, and Vicenç Torra. 2003. “Disclosure Risk Assessment in Statistical Microdata Protection via Advanced Record Linkage.” Statistics and Computing 13 (4): 343–54. https://doi.org/10.1023/A:1025666923033.
Drechsler, Jörg. 2011. Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Springer. https://doi.org/10.1007/978-1-4614-0326-5.
Fellegi, Ivan P, and Alan B Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328): 1183–210. https://doi.org/10.1080/01621459.1969.10501049.
Franconi, Luisa, and Silvia Polettini. 2004. “Individual Risk Estimation in \(\mu\)-Argus: A Review.” Privacy in Statistical Databases, 262–72. https://doi.org/10.1007/978-3-540-25955-8_20.
Giomi, Matteo, Franziska Boenisch, Christoph Wehmeyer, and Borbala Tasnadi. 2023. “A Unified Framework for Quantifying Privacy Risk in Synthetic Data.” Proceedings on Privacy Enhancing Technologies 2023: 312–28. https://doi.org/10.56553/popets-2023-0055.
Gretton, Arthur, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. “A Kernel Two-Sample Test.” Journal of Machine Learning Research 13: 723–73.
Guo, Cheng, and Felix Berkhahn. 2016. “Entity Embeddings of Categorical Variables.” arXiv Preprint arXiv:1604.06737.
Herranz, Javier, Jordi Nin, Pablo Rodríguez, and Tamir Tassa. 2016. “Revisiting Distance-Based Record Linkage for Privacy-Preserving Release of Statistical Datasets.” Data & Knowledge Engineering 100: 78–93. https://doi.org/10.1016/j.datak.2015.06.007.
Hundepool, Anco, Josep Domingo-Ferrer, Luisa Bruin, et al. 2012. Statistical Disclosure Control. John Wiley & Sons. https://doi.org/10.1002/9781118348239.
Karr, Alan F, Christine N Kohnen, Anna Oganian, Jerome P Reiter, and Ashish P Sanil. 2006. “A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality.” The American Statistician 60 (3): 224–32. https://doi.org/10.1198/000313006X124640.
Li, Ninghui, Tiancheng Li, and Suresh Venkatasubramanian. 2007. \(t\)-Closeness: Privacy Beyond \(k\)-Anonymity and \(l\)-Diversity.” IEEE International Conference on Data Engineering, 106–15. https://doi.org/10.1109/ICDE.2007.367856.
Machanavajjhala, Ashwin, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. \(l\)-Diversity: Privacy Beyond \(k\)-Anonymity.” ACM Transactions on Knowledge Discovery from Data 1 (1): 3. https://doi.org/10.1145/1217299.1217302.
Muralidhar, Krishnamurty, and Josep Domingo-Ferrer. 2016. “Rank-Based Record Linkage for Re-Identification Risk Assessment.” Privacy in Statistical Databases, Lecture notes in computer science, vol. 9867: 225–36. https://doi.org/10.1007/978-3-319-45381-1_17.
Nowok, Beata, Gillian M Raab, and Chris Dibben. 2016. “Synthpop: Bespoke Creation of Synthetic Data in r.” Journal of Statistical Software 74 (11): 1–26. https://doi.org/10.18637/jss.v074.i11.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/.
Reiter, Jerome P. 2005. “Estimating Risks of Identification Disclosure in Microdata.” Journal of the American Statistical Association 100 (472): 1103–12. https://doi.org/10.1198/016214505000000619.
Samarati, Pierangela, and Latanya Sweeney. 1998. “Protecting Respondents’ Identities in Microdata Release.” IEEE Transactions on Knowledge and Data Engineering 10 (6): 1010–27.
Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. “Membership Inference Attacks Against Machine Learning Models.” IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41.
Skinner, Chris, and Mark Elliot. 2002. “A Measure of Disclosure Risk for Microdata.” Journal of the Royal Statistical Society: Series B 64 (4): 855–67. https://doi.org/10.1111/1467-9868.00365.
Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A 181 (3): 663–88. https://doi.org/10.1111/rssa.12358.
Stadler, Theresa, Bristena Oprisanu, and Carmela Troncoso. 2022. “Synthetic Data – Anonymisation Groundhog Day.” USENIX Security Symposium, 1451–68. https://doi.org/10.48550/arXiv.2011.07018.
Székely, Gábor J, and Maria L Rizzo. 2013. “Energy Statistics: A Class of Statistics Based on Distances.” Journal of Statistical Planning and Inference 143 (8): 1249–72. https://doi.org/10.1016/j.jspi.2013.03.018.
Taub, Jennifer, Mark Elliot, Maria Pampaka, and Duncan Smith. 2018. “Differential Correct Attribution Probability for Synthetic Data: An Exploration.” Privacy in Statistical Databases, 122–37. https://doi.org/10.1007/978-3-319-99771-1_9.
Templ, Matthias. 2017. Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer. https://doi.org/10.1007/978-3-319-50272-4.
Templ, Matthias, Alexander Kowarik, and Bernhard Meindl. 2015. sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation.” Journal of Statistical Software 67 (4): 1–36. https://doi.org/10.18637/jss.v067.i04.
Templ, Matthias, and Bernhard Meindl. 2008. “Robust Statistics Meets SDC: New Disclosure Risk Measures for Continuous Microdata Masking.” Privacy in Statistical Databases, Lecture notes in computer science, vol. 5262: 113–26. https://doi.org/10.1007/978-3-540-87471-3_15.
Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Olivier Dupriez. 2017. “Simulation of Synthetic Complex Data: The r Package simPop.” Journal of Statistical Software 79 (10): 1–38. https://doi.org/10.18637/jss.v079.i10.
Thees, Oscar, Fabian Müller, and Matthias Templ. 2026. “Beyond the Trade-Off Curve: Multivariate and Advanced Risk-Utility Maps for Evaluating Anonymized and Synthetic Data.” Journal of Official Statistics.
Woo, Mi-Ja, Jerome P Reiter, Anna Oganian, and Alan F Karr. 2009. “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation.” Journal of Privacy and Confidentiality 1 (1): 111–24. https://doi.org/10.29012/jpc.v1i1.568.
Yale, Andrew, Saloni Dash, Ritik Dutta, Isabelle Guyon, Adrien Pavao, and Kristin P Bennett. 2020. “Generation and Evaluation of Privacy Preserving Synthetic Health Data.” Neurocomputing 416: 244–55. https://doi.org/10.1016/j.neucom.2019.12.136.
Yao, Zexi, Nataša Krčo, Georgi Ganev, and Yves-Alexandre de Montjoye. 2025. “The DCR Delusion: Measuring the Privacy Risk of Synthetic Data.” arXiv Preprint arXiv:2505.01524.
Zhao, Zilong, Aditya Kunar, Robert Birke, and Lydia Y Chen. 2021. “CTAB-GAN: Effective Table Data Synthesizing.” Proceedings of Machine Learning Research 157: 97–112.