Does Training with Synthetic Data Truly Protect Privacy?

Zhao, Yunpeng, Zhang, Jie

arXiv.org Artificial Intelligence 

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy. Synthetic data is increasingly utilized for training machine learning (ML) models, especially in situations where real-world data is scarce, sensitive, costly to obtain, or subject to regulations such as GDPR (GDPR.eu). Synthetic data is particularly beneficial in scenarios where data distributions are atypical, such as in federated learning with non-IID data (Zhang et al., 2023c), long-tailed learning (Shin et al., 2023), and continual learning (Meng et al., 2024). It enables the creation of diverse datasets that include edge cases or rare events that may be underrepresented in real-world data. Consequently, training models with synthetic data has proven beneficial for enhancing model robustness and adaptability across a wide range of real-world scenarios. Many empirical methods--without formal differential privacy guarantees--rely on synthetic data for training, such as coreset selection (Feldman, 2020), dataset distillation (Wang et al., 2018), data-free knowledge distillation (Yin et al., 2020), and synthetic data generated from diffusion models (Yuan et al., 2024). This proxy data can be directly sampled from private sources (Guo et al., 2022; Mirzasoleiman et al., 2020) or out-of-distribution sources (Wang et al., 2023), iteratively optimized (Zhang et al., 2023d; Zhao et al., 2020), or generated using GANs (Karras et al., 2019) and diffusion models (Rombach et al., 2022). Since the model may never encounter any private training data and the synthetic images are often visually distinct from the original private data, these methods often claim to preserve privacy while still maintaining satisfactory performance. In this work, we aim to address the following question: Does training with synthetic data truly protect privacy? To rigorously measure the privacy leakage of empirical methods trained on synthetic data, we use membership inference attacks (Shokri et al., 2017) as a privacy auditing tool.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found