NeuroMixGDP: A Neural Collapse-Inspired Random Mixup for Private Data Release
Li, Donghao, Cao, Yang, Yao, Yuan
–arXiv.org Artificial Intelligence
Private data publishing is a technique that involves releasing a modified dataset to preserve user privacy while enabling downstream machine learning tasks. While many private data publishing algorithms exist, traditional algorithms (e.g., DPPro [1], PrivBayes [2], etc.) based on releasing tabular data are not suitable for modern machine learning tasks involving complex structures such as images, videos, and texts. To tackle this, a series of deep learning algorithms have emerged, such as DP-GAN [3] and PATE-GAN [4], which are based on training a Deep Generative Model (DGM) to generate data with complex structures, such as images, texts, and audios. These methods generate fake data based on the trained DGM and publish it instead of the raw data to respect users' privacy. However, as empirically observed by Takagi et al. [5], these DGM-based methods often suffer from training instability, such as mode collapse and high computational costs and lead to low utility, which is defined as the usefulness of the private data. For example, in the case of classification datasets, utility can be measured by classification accuracy. DPMix -- a new data publishing technique proposed by Lee et al. [6] -- does not rely on training deep generative models and has the potential to improve utility. DPMix, as opposed to DGM-based methods, directly adds noise to the raw dataset -- thereby taking into account users' privacy -- and publishes the noisy version of the dataset. Concretely, inspired by Zhang et al. [7], DPMix first mixes the data points by averaging groups of raw data (with group size m), then adds noise to each individual mixture of data points to respect privacy concerns, and finally publishes the noisy
arXiv.org Artificial Intelligence
Dec-5-2023