Training-based versus training-free
differential privacy for data synthesis
Mentor: Yu-Xiang Wang
Differentially private synthetic data generation promises to resolve the tension between data utility and individual privacy, enabling the release of datasets that preserve the statistical properties analysts need while bounding what any adversary can learn about a single record. Two paradigms have emerged. Training-based methods inject calibrated noise during model optimization. Training-free methods leverage foundation models through black-box API access. We investigate both approaches on Intel’s Driver and Client Applications (DCA) telemetry corpus, evaluating against a benchmark of 21 analytical SQL queries representative of production business intelligence workloads.
Introduction
Why does this matter?
Device telemetry, including hardware specs, usage patterns, and network activity, helps product teams improve software, but it contains sensitive information about individuals. Releasing raw telemetry poses privacy risks. Differential privacy (DP) lets us publish synthetic data that preserves statistical utility while mathematically bounding the information leaked about any single record.
The core question we investigate is which DP synthesis approach best preserves the utility of production SQL workloads on multi-table relational data.
The privacy-utility tension
Differential privacy adds carefully calibrated noise so that an algorithm’s output is nearly identical whether or not any single record is in the dataset. The privacy “cost” is measured by a parameter $\epsilon$ (epsilon). Smaller $\epsilon$ means stronger privacy but potentially lower utility. We use $(\epsilon, \delta)$-differential privacy with $\epsilon = 4.0$ and $\delta = 10^{-5}$, matched across all methods for fair comparison.
Formal definition (click to expand)
A randomized mechanism $M$ satisfies $(\epsilon, \delta)$-DP if for all neighboring datasets $D$, $D'$ (differing in one record) and all measurable sets $S$.
$$\Pr[M(D) \in S] \le e^{\epsilon} \cdot \Pr[M(D') \in S] + \delta$$
In practice, DP-SGD tracks privacy loss via Rényi Differential Privacy (RDP) accountants and converts to $(\epsilon, \delta)$ at the end of training.
Two paradigms for synthetic data
Training-based methods inject calibrated noise during model optimization, as in DP-SGD with VAEs. Training-free methods leverage foundation models through black-box API access, as in Private Evolution with LLMs. We compare four concrete instantiations of these paradigms on a realistic multi-table telemetry benchmark.
Data
Intel DCA telemetry corpus
We use the Intel DCA corpus, a multi-table telemetry dataset with 8 relational tables covering hardware specifications, software configurations, network activity, power consumption, and user behavior across thousands of devices.
| Table | Rows | Columns | Description |
|---|---|---|---|
| System | ~7,700 | 12 | Device hardware and OS metadata |
| Battery | ~4,800 | 8 | Battery health and charge cycles |
| Display | ~7,700 | 6 | Display vendor, resolution, type |
| Network | ~7,700 | 10 | Network usage (bytes sent/received) |
| Power | ~7,700 | 8 | Power consumption metrics |
| Application | ~83,000 | 5 | Installed applications per device |
| Detection | ~12,000 | 4 | Security detections per device |
| WaitState | ~5,400 | 6 | CPU wait state analysis |
Tables are linked by a shared system_id foreign key. This relational structure is central
to our evaluation, and many benchmark queries require joining 2 to 3 tables.
The 21-query SQL benchmark
We designed a benchmark of 21 SQL queries spanning five types, each with type-appropriate metrics and pass/fail thresholds.
| Type | Count | Primary metric | Example |
|---|---|---|---|
| Agg+Join | 6 | Relative Error (RE) | Avg power by country and CPU family |
| Geo/Demo | 4 | RE + Group Coverage | Battery health by geography |
| Histogram | 2 | Total Variation (TV) | Browser usage distribution |
| Pivot | 2 | RE + Jaccard | Persona breakdown by web category |
| Top-k | 7 | Spearman + Jaccard | Top 10 applications by system count |
Formal benchmark definition
| Metric | Applies to | Pass threshold | Interpretation |
|---|---|---|---|
| Relative Error | Aggregates | $\le 0.3$ (median) | Lower is better. Synthetic aggregates close to real. |
| Total Variation | Distributions | $\le 0.15$ | Lower is better. Distributions overlap well. |
| Jaccard Similarity | Group coverage | $\ge 0.5$ | Higher is better. Same groups appear in both. |
| Spearman Rho | Rankings | $\ge 0.7$ | Higher is better. Same items ranked similarly. |
A query passes only if all its relevant metrics meet their thresholds.
Data handling
There is no data leakage in our setup. Synthetic data is generated from the full dataset, and benchmark queries test statistical fidelity rather than prediction. All published results are aggregate statistics or synthetic data, and no individual records are ever exposed. During preprocessing, continuous columns are normalized to [0, 1] and categorical columns are label-encoded. Missing values are imputed per table.
Methods
1. Wide-table DP-VAE
Flatten all tables into a single wide table (left-join on system_id), then train a
Variational Autoencoder (VAE) with DP-SGD. The VAE learns a compressed latent representation and
generates synthetic records that preserve cross-table structure.
This approach preserves foreign-key relationships by design. However, extreme sparsity from flattening (most columns are null) causes zero-inflation and mode collapse.
- Approach
- Training-based
- Strategy
- Flatten all tables, train single model
- Mechanism
- VAE encoder-decoder + DP-SGD
- Strength
- Preserves cross-table joins
- Weakness
- Extreme sparsity, mode collapse
2. Per-table DP-SGD
Train a separate VAE for each table independently using DP-SGD. This avoids the sparsity problem of wide-table flattening but breaks cross-table relationships.
This approach achieves high within-table fidelity, and each model is simpler and faster to train. The downside is foreign-key mismatches, where a synthetic CPU record may reference a non-existent system.
- Approach
- Training-based
- Strategy
- Independent model per table
- Mechanism
- Per-table VAE + DP-SGD
- Strength
- High within-table fidelity
- Weakness
- Breaks foreign-key joins
3. MST (Maximum Spanning Tree)
MST is a histogram-based baseline. It estimates low-dimensional marginals under DP, builds a dependency graph (Chow-Liu tree), and samples synthetic records that are consistent with the noisy marginals. It is interpretable, fast, and works well for categorical data with clear dependencies. On the other hand, it scales exponentially in table width and struggles with many continuous or high-cardinality columns.
- Approach
- Statistical (training-free)
- Strategy
- Marginal estimation + dependency tree
- Mechanism
- Noisy marginals + Chow-Liu tree
- Strength
- Fast, interpretable
- Weakness
- Scales poorly with high-cardinality
4. Private Evolution
Private Evolution is a training-free approach. A large language model (LLM) generates candidate synthetic records, and a DP selection mechanism keeps only the candidates that best match real data distributions. No gradient-based training is required, and the method can target specific query workloads.
We evaluate two variants. Vanilla PE generates records randomly and applies DP histogram voting to select the closest candidates. Conditional PE partitions the privacy budget across query groups using zero-concentrated DP (zCDP) composition, then generates records conditioned on each group's schema and statistics. This allows the model to focus its budget on the metrics that matter for each query type rather than spreading noise uniformly.
The main limitation of both variants is dependence on the LLM's training data. Categories or values unseen by the model cannot be synthesized, which creates coverage gaps for rare or domain-specific values.
- Approach
- LLM-based (training-free)
- Strategy
- Random generation + DP selection
- Mechanism
- LLM candidates + histogram voting
- Strength
- No training required
- Weakness
- Noise spread uniformly
- Approach
- LLM-based (training-free)
- Strategy
- Query-aware generation + zCDP split
- Mechanism
- Conditioned LLM + budget partitioning
- Strength
- Workload-aware, lowest norm. error
- Weakness
- LLM knowledge ceiling
Privacy budget comparison
All methods use $\epsilon = 4.0$, $\delta = 10^{-5}$. Training-based methods (DP-SGD) track cumulative privacy loss via Rényi DP accountants. Training-free methods (PE) spend their budget through the exponential mechanism’s selection step.
What the team built vs. reused
The team built the 21-query benchmark, the per-table DP-SGD pipeline, the Private Evolution integration, and the evaluation harness. We reused the MST implementation from the private-pgm library, the VAE architecture adapted from SDV, and Opacus for DP-SGD gradient clipping.
Results
Overall pass rates
Wide-table DP-VAE and both PE variants were evaluated on 10 single-table queries. Per-table DP-SGD and MST were evaluated on all 21.
Per-table DP-SGD
6 / 21
Best on aggregates ($\text{RE} = 0.54$)
MST (Marginal)
6 / 21
Best on distributions ($\text{TV} = 0.105$)
Wide-table DP-VAE
1 / 10
Sparsity collapse
PE (Conditional)
3 / 10
Vanilla baseline: 1 / 10
Continuous performance metrics
Beyond pass/fail, we report average scores and error metrics to understand how close each method gets.
| Method | Avg score | Median norm. error | Best query type |
|---|---|---|---|
| Per-table DP-SGD | 0.303 | 2.93 | Agg+Join (RE = 0.54) |
| MST | 0.328 | 10.76 | Histogram (TV = 0.105) |
| PE (Conditional) | 0.306 | 3.79 | Pivot (SleepPivot = 0.818) |
| Wide-table DP-VAE | 0.206 | 16.42 | Top-k (BrowserRank = 1.0) |
| PE (Vanilla) | 0.137 | N/A | Histogram (RamHist = 0.5) |
Avg score ranges $0$–$1$ (higher is better, $1$ = perfect fidelity). Median normalized error measures how far off the synthetic data is on average (lower is better).
Per-query breakdown
Full scores for all 21 queries are shown below, now including both PE variants (vanilla baseline and conditional). Green indicates a pass, red indicates a fail, amber indicates a partial result, and gray means the query was not evaluated.
Loading query scores…
Key insights
What worked
Simple distributions like histogram queries for browser usage and RAM utilization passed for MST and per-table DP-SGD, with Total Variation distances below 0.15. Single-table aggregates on independent features within one table were well-preserved by per-table DP-SGD. MST also excelled at queries with high-cardinality group-by keys, where marginal-based synthesis proved effective. Conditional PE achieved the second-highest average score (0.306), matching per-table DP-SGD, and produced the lowest normalized error on 4 of 9 evaluated queries (RamHist, XeonGeo, SleepPivot, BatteryDemo).
What failed
Join-heavy aggregates requiring cross-table joins failed across all methods due to foreign-key mismatches. DP-VAE on the wide table suffered from zero-inflation, and continuous distributions collapsed to near-zero values. Vanilla PE achieved only 1 of 10 passes, showing that random generation without query-aware budget allocation spreads noise too thinly. Both PE variants could not synthesize categories unseen in the LLM's training data, leaving knowledge gaps for rare values.
Differences in pass rates of 2 or more queries are significant, while score differences below 0.05 are within noise. The gap between Per-table/MST (6/21) and Wide/PE (1 to 3 out of 10) reflects a fundamental structural limitation, not just noise. Conditional PE's improvement over vanilla (3/10 vs. 1/10) shows that query-aware budget allocation meaningfully boosts utility.
Failure modes
1. Sparsity collapse (Wide-table DP-VAE)
Flattening 8 tables into one wide table creates extreme sparsity. The VAE encodes most columns as near-zero, collapsing continuous metrics (power, network bytes, RAM) to degenerate distributions.
2. Independence assumption (Per-table DP-SGD, MST)
Generating each table independently destroys foreign-key relationships. Cross-table joins produce mismatched groups, inflating relative error and reducing Jaccard similarity below thresholds.
3. LLM knowledge ceiling (Private Evolution)
The LLM does not know all device vendors, OS versions, or geographic regions in the real dataset. This causes 0% group coverage on queries involving rare categories. Conditional PE mitigates this partially by focusing generation on specific column groups, but fundamental coverage gaps remain.
4. DP noise masking signal (all methods)
At $\epsilon = 4.0$, privacy noise is large enough to mask subtle correlations (e.g., CPU frequency vs. power consumption) and rare subgroups.
Pass rate by method
Average score by query type
What changed after early attempts
We started with only the wide-table DP-VAE, which achieved a pass rate of just 1 out of 10 and prompted us to explore per-table splitting. Splitting improved single-table queries dramatically but revealed the join problem. Adding the MST histogram baseline matched per-table DP-SGD on pass rate (6/21) but with different query strengths. Vanilla PE showed promise but was limited by uniform noise allocation (1/10 passes). Conditional PE, which partitions the privacy budget across query groups using zCDP composition, tripled the pass rate to 3/10 and achieved the lowest normalized error on 4 of 9 evaluated queries. The overarching takeaway is that no single method dominates, and a hybrid routing strategy using different methods for different query types is the most promising path forward.
Discussion
Key takeaways
1. Per-table independence breaks cross-table queries
Per-table DP-SGD and MST generated high-fidelity data within individual tables but failed on queries joining multiple tables. Foreign-key mismatches caused inflated relative errors and low Jaccard overlap.
Protecting per-table privacy is not sufficient for relational workloads. Future methods must reason about foreign-key relationships directly.
2. Wide-table sparsity collapses continuous metrics
Flattening all tables into one wide table introduced extreme sparsity. The VAE encoded most columns as near-zero, destroying continuous distributions (power consumption, network throughput, RAM usage).
3. Training-free methods have a knowledge ceiling
Both PE variants relied on an LLM's pre-trained knowledge. When real data contained categories unseen by the model (geographic regions, rare OS versions, niche applications), the method produced 0% group coverage on those queries. Conditional PE improved utility through smarter budget allocation but could not overcome fundamental coverage gaps in the LLM's training data.
4. Simple distributions are tractable
Queries on categorical distributions with few categories (browser usage, vendor percentages) passed across multiple methods. This suggests distribution synthesis is largely solved for non-sparse data.
5. DP noise is necessary but limiting
Even at $\epsilon = 4.0$ (moderate privacy), noise masked subtle correlations and rare subgroups. Query-specific privacy budgets via advanced composition may help allocate noise more efficiently.
Limitations
We only evaluated at a single privacy budget of $\epsilon = 4.0$, and performance at stricter budgets ($\epsilon < 1$) may differ substantially. Our results are also specific to the Intel DCA corpus, and other domains like healthcare or finance may show different method rankings. We did not implement methods that reason about foreign keys during synthesis, such as DP joins or relational GANs. Private Evolution used a general-purpose LLM, and domain-specific fine-tuning could improve category coverage. Finally, our 21 queries, while diverse, may not cover all production workload patterns, so results should be interpreted as indicative rather than exhaustive.
Method comparison
| Aspect | Per-table DP-SGD | Wide-table DP-VAE | MST | PE (Conditional) |
|---|---|---|---|---|
| Within-table quality | High | Low | High | Medium |
| Cross-table fidelity | Poor | Medium | Poor | N/A |
| Training required | Yes (per table) | Yes (large model) | No | No |
| Speed | Medium | Slow | Fast | Fast (if LLM cached) |
| Interpretability | Low | Low | High | Low |
| Workload awareness | No | No | No | Yes (zCDP split) |
Recommendations for practitioners
No single method dominates. We recommend routing queries to different methods based on type. For simple aggregates and distributions, MST is fast, interpretable, and achieves good TV scores. For row-level analytics, per-table DP-SGD offers the best within-table fidelity. For ranking queries, MST provides good Jaccard overlap on top-k items. For targeted single-table workloads, conditional PE offers competitive accuracy with the lowest median normalized error (3.79) of any method we tested. Cross-table joins remain an open challenge that requires future relational DP methods, as current approaches are insufficient.
Future directions
Relational differential privacy
Develop methods that reason about foreign-key relationships directly, applying DP at the join level. This is the most impactful open problem identified by our work.
Domain-adapted LLMs for PE
Fine-tune LLMs on domain-specific schemas and category lists to improve PE's coverage of rare values. Conditional PE's query-aware budget partitioning could be extended with more granular group definitions and adaptive allocation strategies.
Multi-epsilon evaluation
Characterize the privacy-utility frontier by sweeping $\epsilon$ from $0.1$ to $10$ for each method.
Broader datasets
Validate findings on healthcare, financial, and census datasets to test generalizability.
Conclusion
Training-based and training-free DP approaches each have strengths and weaknesses, and there is no silver bullet. Our evaluation shows that simple distributions are tractable while complex joins are not yet solved. Per-table and wide-table synthesis represent opposite trade-offs. Conditional PE demonstrates that query-aware budget allocation can meaningfully improve training-free methods, tripling the pass rate from 1/10 to 3/10 and achieving competitive normalized error. Hybrid query-aware routing significantly improves overall performance. Future work in relational DP and workload-aware composition will be critical for closing the remaining gaps.
References
Dwork, C., and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science.
Abadi, M., et al. (2016). Deep Learning with Differential Privacy. CCS 2016.
McKenna, R., Miklau, G., and Sheldon, D. (2021). Winning the NIST Contest: A Scalable and General Approach to Differentially Private Synthetic Data. VLDB 2021.
Xie, Z., et al. (2024). Private Evolution: Training-Free Differentially Private Synthetic Data via LLMs.