Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
Schlegel, Viktor, Bharath, Anil A, Zhao, Zilong, Yee, Kevin
–arXiv.org Artificial Intelligence
Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.
arXiv.org Artificial Intelligence
Mar-26-2025
- Country:
- North America > United States (0.93)
- Genre:
- Overview (1.00)
- Research Report > Promising Solution (0.86)
- Industry:
- Banking & Finance (0.67)
- Government (0.67)
- Health & Medicine
- Diagnostic Medicine > Imaging (0.92)
- Health Care Technology (0.92)
- Therapeutic Area (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.92)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Learning Graphical Models (0.67)
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (1.00)
- Statistical Learning (1.00)
- Natural Language
- Chatbot (0.67)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Machine Learning
- Communications (1.00)
- Data Science > Data Mining
- Big Data (0.86)
- Information Management (1.00)
- Knowledge Management (1.00)
- Security & Privacy (1.00)
- Artificial Intelligence
- Information Technology