Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Schlegel, Viktor, Bharath, Anil A, Zhao, Zilong, Yee, Kevin

Mar-26-2025–arXiv.org Artificial Intelligence

Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

knowledge management, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Mar-26-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Montserrat (0.04)
  - United States
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - New York > New York County
      - New York City (0.04)
    - Indiana > Marion County
      - Indianapolis (0.04)
    - California > San Diego County
      - San Diego (0.04)
- Europe > Belgium
  - Brussels-Capital Region > Brussels (0.04)
- Asia
  - Singapore (0.04)
  - Middle East > Jordan (0.04)

Genre:
- Overview (1.00)
- Research Report > Promising Solution (0.86)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.92)
- Banking & Finance (0.67)
- Health & Medicine
  - Therapeutic Area (1.00)
  - Health Care Technology (0.92)
  - Diagnostic Medicine > Imaging (0.92)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Knowledge Management (1.00)
  - Information Management (1.00)
  - Communications (1.00)
  - Data Science > Data Mining
    - Big Data (0.86)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (0.67)
    - Machine Learning
      - Statistical Learning (1.00)
      - Performance Analysis > Accuracy (1.00)
      - Neural Networks > Deep Learning (1.00)
      - Learning Graphical Models (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found