Training-based versus training-free differential privacy for data synthesis

Differentially private synthetic data generation promises to resolve the tension between data utility and individual privacy, enabling the release of datasets that preserve the statistical properties analysts need while bounding what any adversary can learn about a single record. Two paradigms have emerged. Training-based methods inject calibrated noise during model optimization. Training-free methods leverage foundation models through black-box API access. We investigate both approaches on Intel’s Driver and Client Applications (DCA) telemetry corpus, evaluating against a benchmark of 21 analytical SQL queries representative of production business intelligence workloads.

Introduction

Why does this matter?

Device telemetry, including hardware specs, usage patterns, and network activity, helps product teams improve software, but it contains sensitive information about individuals. Releasing raw telemetry poses privacy risks. Differential privacy (DP) lets us publish synthetic data that preserves statistical utility while mathematically bounding the information leaked about any single record.

The core question we investigate is which DP synthesis approach best preserves the utility of production SQL workloads on multi-table relational data.

The privacy-utility tension

Differential privacy adds carefully calibrated noise so that an algorithm’s output is nearly identical whether or not any single record is in the dataset. The privacy “cost” is measured by a parameter $\epsilon$ (epsilon). Smaller $\epsilon$ means stronger privacy but potentially lower utility. We use $(\epsilon, \delta)$-differential privacy with $\epsilon = 4.0$ and $\delta = 10^{-5}$, matched across all methods for fair comparison.

Formal definition (click to expand)

A randomized mechanism $M$ satisfies $(\epsilon, \delta)$-DP if for all neighboring datasets $D$, $D'$ (differing in one record) and all measurable sets $S$.

$$\Pr[M(D) \in S] \le e^{\epsilon} \cdot \Pr[M(D') \in S] + \delta$$

In practice, DP-SGD tracks privacy loss via Rényi Differential Privacy (RDP) accountants and converts to $(\epsilon, \delta)$ at the end of training.

Two paradigms for synthetic data

Training-based methods inject calibrated noise during model optimization, as in DP-SGD with VAEs. Training-free methods leverage foundation models through black-box API access, as in Private Evolution with LLMs. We compare four concrete instantiations of these paradigms on a realistic multi-table telemetry benchmark.

Data

Intel DCA telemetry corpus

We use the Intel DCA corpus, a multi-table telemetry dataset with 8 relational tables covering hardware specifications, software configurations, network activity, power consumption, and user behavior across thousands of devices.

Table	Rows	Columns	Description
System	~7,700	12	Device hardware and OS metadata
Battery	~4,800	8	Battery health and charge cycles
Display	~7,700	6	Display vendor, resolution, type
Network	~7,700	10	Network usage (bytes sent/received)
Power	~7,700	8	Power consumption metrics
Application	~83,000	5	Installed applications per device
Detection	~12,000	4	Security detections per device
WaitState	~5,400	6	CPU wait state analysis

Tables are linked by a shared system_id foreign key. This relational structure is central to our evaluation, and many benchmark queries require joining 2 to 3 tables.

The 21-query SQL benchmark

We designed a benchmark of 21 SQL queries spanning five types, each with type-appropriate metrics and pass/fail thresholds.

Type	Count	Primary metric	Example
Agg+Join	6	Relative Error (RE)	Avg power by country and CPU family
Geo/Demo	4	RE + Group Coverage	Battery health by geography
Histogram	2	Total Variation (TV)	Browser usage distribution
Pivot	2	RE + Jaccard	Persona breakdown by web category
Top-k	7	Spearman + Jaccard	Top 10 applications by system count

Formal benchmark definition

Metric	Applies to	Pass threshold	Interpretation
Relative Error	Aggregates	$\le 0.3$ (median)	Lower is better. Synthetic aggregates close to real.
Total Variation	Distributions	$\le 0.15$	Lower is better. Distributions overlap well.
Jaccard Similarity	Group coverage	$\ge 0.5$	Higher is better. Same groups appear in both.
Spearman Rho	Rankings	$\ge 0.7$	Higher is better. Same items ranked similarly.

A query passes only if all its relevant metrics meet their thresholds.

Data handling

There is no data leakage in our setup. Synthetic data is generated from the full dataset, and benchmark queries test statistical fidelity rather than prediction. All published results are aggregate statistics or synthetic data, and no individual records are ever exposed. During preprocessing, continuous columns are normalized to [0, 1] and categorical columns are label-encoded. Missing values are imputed per table.

Methods

1. Wide-table DP-VAE

Flatten all tables into a single wide table (left-join on system_id), then train a Variational Autoencoder (VAE) with DP-SGD. The VAE learns a compressed latent representation and generates synthetic records that preserve cross-table structure.

This approach preserves foreign-key relationships by design. However, extreme sparsity from flattening (most columns are null) causes zero-inflation and mode collapse.

Wide-table DP-VAE 1 / 10

Approach: Training-based
Strategy: Flatten all tables, train single model
Mechanism: VAE encoder-decoder + DP-SGD
Strength: Preserves cross-table joins
Weakness: Extreme sparsity, mode collapse

2. Per-table DP-SGD

Train a separate VAE for each table independently using DP-SGD. This avoids the sparsity problem of wide-table flattening but breaks cross-table relationships.

This approach achieves high within-table fidelity, and each model is simpler and faster to train. The downside is foreign-key mismatches, where a synthetic CPU record may reference a non-existent system.

Per-table DP-SGD 6 / 21

Approach: Training-based
Strategy: Independent model per table
Mechanism: Per-table VAE + DP-SGD
Strength: High within-table fidelity
Weakness: Breaks foreign-key joins

3. MST (Maximum Spanning Tree)

MST is a histogram-based baseline. It estimates low-dimensional marginals under DP, builds a dependency graph (Chow-Liu tree), and samples synthetic records that are consistent with the noisy marginals. It is interpretable, fast, and works well for categorical data with clear dependencies. On the other hand, it scales exponentially in table width and struggles with many continuous or high-cardinality columns.

MST 6 / 21

Approach: Statistical (training-free)
Strategy: Marginal estimation + dependency tree
Mechanism: Noisy marginals + Chow-Liu tree
Strength: Fast, interpretable
Weakness: Scales poorly with high-cardinality

4. Private Evolution

Private Evolution is a training-free approach. A large language model (LLM) generates candidate synthetic records, and a DP selection mechanism keeps only the candidates that best match real data distributions. No gradient-based training is required, and the method can target specific query workloads.

We evaluate two variants. Vanilla PE generates records randomly and applies DP histogram voting to select the closest candidates. Conditional PE partitions the privacy budget across query groups using zero-concentrated DP (zCDP) composition, then generates records conditioned on each group's schema and statistics. This allows the model to focus its budget on the metrics that matter for each query type rather than spreading noise uniformly.

The main limitation of both variants is dependence on the LLM's training data. Categories or values unseen by the model cannot be synthesized, which creates coverage gaps for rare or domain-specific values.

PE (Vanilla) 1 / 10

Approach: LLM-based (training-free)
Strategy: Random generation + DP selection
Mechanism: LLM candidates + histogram voting
Strength: No training required
Weakness: Noise spread uniformly

PE (Conditional) 3 / 10

Approach: LLM-based (training-free)
Strategy: Query-aware generation + zCDP split
Mechanism: Conditioned LLM + budget partitioning
Strength: Workload-aware, lowest norm. error
Weakness: LLM knowledge ceiling

Privacy budget comparison

All methods use $\epsilon = 4.0$, $\delta = 10^{-5}$. Training-based methods (DP-SGD) track cumulative privacy loss via Rényi DP accountants. Training-free methods (PE) spend their budget through the exponential mechanism’s selection step.

What the team built vs. reused

The team built the 21-query benchmark, the per-table DP-SGD pipeline, the Private Evolution integration, and the evaluation harness. We reused the MST implementation from the private-pgm library, the VAE architecture adapted from SDV, and Opacus for DP-SGD gradient clipping.

Results

Overall pass rates

Wide-table DP-VAE and both PE variants were evaluated on 10 single-table queries. Per-table DP-SGD and MST were evaluated on all 21.

Per-table DP-SGD

6 / 21

Best on aggregates ($\text{RE} = 0.54$)

MST (Marginal)

6 / 21

Best on distributions ($\text{TV} = 0.105$)

Wide-table DP-VAE

1 / 10

Sparsity collapse

PE (Conditional)

3 / 10

Vanilla baseline: 1 / 10

Continuous performance metrics

Beyond pass/fail, we report average scores and error metrics to understand how close each method gets.

Method	Avg score	Median norm. error	Best query type
Per-table DP-SGD	0.303	2.93	Agg+Join (RE = 0.54)
MST	0.328	10.76	Histogram (TV = 0.105)
PE (Conditional)	0.306	3.79	Pivot (SleepPivot = 0.818)
Wide-table DP-VAE	0.206	16.42	Top-k (BrowserRank = 1.0)
PE (Vanilla)	0.137	N/A	Histogram (RamHist = 0.5)

Avg score ranges $0$–$1$ (higher is better, $1$ = perfect fidelity). Median normalized error measures how far off the synthetic data is on average (lower is better).

Per-query breakdown

Full scores for all 21 queries are shown below, now including both PE variants (vanilla baseline and conditional). Green indicates a pass, red indicates a fail, amber indicates a partial result, and gray means the query was not evaluated.

Loading query scores…

Key insights

What worked

Simple distributions like histogram queries for browser usage and RAM utilization passed for MST and per-table DP-SGD, with Total Variation distances below 0.15. Single-table aggregates on independent features within one table were well-preserved by per-table DP-SGD. MST also excelled at queries with high-cardinality group-by keys, where marginal-based synthesis proved effective. Conditional PE achieved the second-highest average score (0.306), matching per-table DP-SGD, and produced the lowest normalized error on 4 of 9 evaluated queries (RamHist, XeonGeo, SleepPivot, BatteryDemo).

What failed

Join-heavy aggregates requiring cross-table joins failed across all methods due to foreign-key mismatches. DP-VAE on the wide table suffered from zero-inflation, and continuous distributions collapsed to near-zero values. Vanilla PE achieved only 1 of 10 passes, showing that random generation without query-aware budget allocation spreads noise too thinly. Both PE variants could not synthesize categories unseen in the LLM's training data, leaving knowledge gaps for rare values.

Differences in pass rates of 2 or more queries are significant, while score differences below 0.05 are within noise. The gap between Per-table/MST (6/21) and Wide/PE (1 to 3 out of 10) reflects a fundamental structural limitation, not just noise. Conditional PE's improvement over vanilla (3/10 vs. 1/10) shows that query-aware budget allocation meaningfully boosts utility.

Failure modes

1. Sparsity collapse (Wide-table DP-VAE)

Flattening 8 tables into one wide table creates extreme sparsity. The VAE encodes most columns as near-zero, collapsing continuous metrics (power, network bytes, RAM) to degenerate distributions.

2. Independence assumption (Per-table DP-SGD, MST)

Generating each table independently destroys foreign-key relationships. Cross-table joins produce mismatched groups, inflating relative error and reducing Jaccard similarity below thresholds.

3. LLM knowledge ceiling (Private Evolution)

The LLM does not know all device vendors, OS versions, or geographic regions in the real dataset. This causes 0% group coverage on queries involving rare categories. Conditional PE mitigates this partially by focusing generation on specific column groups, but fundamental coverage gaps remain.

4. DP noise masking signal (all methods)

At $\epsilon = 4.0$, privacy noise is large enough to mask subtle correlations (e.g., CPU frequency vs. power consumption) and rare subgroups.

Pass rate by method

Average score by query type

What changed after early attempts

We started with only the wide-table DP-VAE, which achieved a pass rate of just 1 out of 10 and prompted us to explore per-table splitting. Splitting improved single-table queries dramatically but revealed the join problem. Adding the MST histogram baseline matched per-table DP-SGD on pass rate (6/21) but with different query strengths. Vanilla PE showed promise but was limited by uniform noise allocation (1/10 passes). Conditional PE, which partitions the privacy budget across query groups using zCDP composition, tripled the pass rate to 3/10 and achieved the lowest normalized error on 4 of 9 evaluated queries. The overarching takeaway is that no single method dominates, and a hybrid routing strategy using different methods for different query types is the most promising path forward.

Discussion

Key takeaways

1. Per-table independence breaks cross-table queries

Per-table DP-SGD and MST generated high-fidelity data within individual tables but failed on queries joining multiple tables. Foreign-key mismatches caused inflated relative errors and low Jaccard overlap.

Protecting per-table privacy is not sufficient for relational workloads. Future methods must reason about foreign-key relationships directly.

2. Wide-table sparsity collapses continuous metrics

Flattening all tables into one wide table introduced extreme sparsity. The VAE encoded most columns as near-zero, destroying continuous distributions (power consumption, network throughput, RAM usage).

3. Training-free methods have a knowledge ceiling

Both PE variants relied on an LLM's pre-trained knowledge. When real data contained categories unseen by the model (geographic regions, rare OS versions, niche applications), the method produced 0% group coverage on those queries. Conditional PE improved utility through smarter budget allocation but could not overcome fundamental coverage gaps in the LLM's training data.

4. Simple distributions are tractable

Queries on categorical distributions with few categories (browser usage, vendor percentages) passed across multiple methods. This suggests distribution synthesis is largely solved for non-sparse data.

5. DP noise is necessary but limiting

Even at $\epsilon = 4.0$ (moderate privacy), noise masked subtle correlations and rare subgroups. Query-specific privacy budgets via advanced composition may help allocate noise more efficiently.

Limitations

We only evaluated at a single privacy budget of $\epsilon = 4.0$, and performance at stricter budgets ($\epsilon < 1$) may differ substantially. Our results are also specific to the Intel DCA corpus, and other domains like healthcare or finance may show different method rankings. We did not implement methods that reason about foreign keys during synthesis, such as DP joins or relational GANs. Private Evolution used a general-purpose LLM, and domain-specific fine-tuning could improve category coverage. Finally, our 21 queries, while diverse, may not cover all production workload patterns, so results should be interpreted as indicative rather than exhaustive.

Method comparison

Aspect	Per-table DP-SGD	Wide-table DP-VAE	MST	PE (Conditional)
Within-table quality	High	Low	High	Medium
Cross-table fidelity	Poor	Medium	Poor	N/A
Training required	Yes (per table)	Yes (large model)	No	No
Speed	Medium	Slow	Fast	Fast (if LLM cached)
Interpretability	Low	Low	High	Low
Workload awareness	No	No	No	Yes (zCDP split)

Recommendations for practitioners

No single method dominates. We recommend routing queries to different methods based on type. For simple aggregates and distributions, MST is fast, interpretable, and achieves good TV scores. For row-level analytics, per-table DP-SGD offers the best within-table fidelity. For ranking queries, MST provides good Jaccard overlap on top-k items. For targeted single-table workloads, conditional PE offers competitive accuracy with the lowest median normalized error (3.79) of any method we tested. Cross-table joins remain an open challenge that requires future relational DP methods, as current approaches are insufficient.

Future directions

Relational differential privacy

Develop methods that reason about foreign-key relationships directly, applying DP at the join level. This is the most impactful open problem identified by our work.

Domain-adapted LLMs for PE

Fine-tune LLMs on domain-specific schemas and category lists to improve PE's coverage of rare values. Conditional PE's query-aware budget partitioning could be extended with more granular group definitions and adaptive allocation strategies.

Multi-epsilon evaluation

Characterize the privacy-utility frontier by sweeping $\epsilon$ from $0.1$ to $10$ for each method.

Broader datasets

Validate findings on healthcare, financial, and census datasets to test generalizability.

Conclusion

Training-based and training-free DP approaches each have strengths and weaknesses, and there is no silver bullet. Our evaluation shows that simple distributions are tractable while complex joins are not yet solved. Per-table and wide-table synthesis represent opposite trade-offs. Conditional PE demonstrates that query-aware budget allocation can meaningfully improve training-free methods, tripling the pass rate from 1/10 to 3/10 and achieving competitive normalized error. Hybrid query-aware routing significantly improves overall performance. Future work in relational DP and workload-aware composition will be critical for closing the remaining gaps.

References

Dwork, C., and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science.

Abadi, M., et al. (2016). Deep Learning with Differential Privacy. CCS 2016.

McKenna, R., Miklau, G., and Sheldon, D. (2021). Winning the NIST Contest: A Scalable and General Approach to Differentially Private Synthetic Data. VLDB 2021.

Xie, Z., et al. (2024). Private Evolution: Training-Free Differentially Private Synthetic Data via LLMs.

Project GitHub repository | Full technical report (PDF)

Training-based versus training-freedifferential privacy for data synthesis

Introduction

Why does this matter?

The privacy-utility tension

Two paradigms for synthetic data

Data

Intel DCA telemetry corpus

The 21-query SQL benchmark

Formal benchmark definition

Data handling

Methods

1. Wide-table DP-VAE

2. Per-table DP-SGD

3. MST (Maximum Spanning Tree)

4. Private Evolution

Privacy budget comparison

What the team built vs. reused

Results

Overall pass rates

Per-table DP-SGD

MST (Marginal)

Wide-table DP-VAE

PE (Conditional)

Continuous performance metrics

Per-query breakdown

Key insights

What worked

What failed

Failure modes

1. Sparsity collapse (Wide-table DP-VAE)

2. Independence assumption (Per-table DP-SGD, MST)

3. LLM knowledge ceiling (Private Evolution)

4. DP noise masking signal (all methods)

What changed after early attempts

Discussion

Key takeaways

1. Per-table independence breaks cross-table queries

2. Wide-table sparsity collapses continuous metrics

3. Training-free methods have a knowledge ceiling

4. Simple distributions are tractable

5. DP noise is necessary but limiting

Limitations

Method comparison

Recommendations for practitioners

Future directions

Relational differential privacy

Domain-adapted LLMs for PE

Multi-epsilon evaluation

Broader datasets

References

Training-based versus training-free
differential privacy for data synthesis