tabular generative model
Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis
Ward, Joshua, Lin, Xiaofeng, Wang, Chi-Hua, Cheng, Guang
Tabular Generative Models are often argued to preserve privacy by creating synthetic datasets that resemble training data. However, auditing their empirical privacy remains challenging, as commonly used similarity metrics fail to effectively characterize privacy risk. Membership Inference Attacks (MIAs) have recently emerged as a method for evaluating privacy leakage in synthetic data, but their practical effectiveness is limited. Numerous attacks exist across different threat models, each with distinct implementations targeting various sources of privacy leakage, making them difficult to apply consistently. Moreover, no single attack consistently outperforms the others, leading to a routine underestimation of privacy risk. To address these issues, we propose a unified, model-agnostic threat framework that deploys a collection of attacks to estimate the maximum empirical privacy leakage in synthetic datasets. We introduce Synth-MIA, an open-source Python library that streamlines this auditing process through a novel testbed that integrates seamlessly into existing synthetic data evaluation pipelines through a Scikit-Learn-like API. Our software implements 13 attack methods through a Scikit-Learn-like API, designed to enable fast systematic estimation of privacy leakage for practitioners as well as facilitate the development of new attacks and experiments for researchers. We demonstrate our framework's utility in the largest tabular synthesis privacy benchmark to date, revealing that higher synthetic data quality corresponds to greater privacy leakage, that similarity-based privacy metrics show weak correlation with MIA results, and that the differentially private generator PATEGAN can fail to preserve privacy under such attacks. This underscores the necessity of MIA-based auditing when designing and deploying Tabular Generative Models.
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks
Ward, Joshua, Wang, Chi-Hua, Cheng, Guang
Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and detect the privacy exposure of training data through synthetic data release. In this paper, we study designing Membership Inference Attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. Here, we propose Generative Likelihood Ratio Attack (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data. Assessed over a comprehensive benchmark spanning diverse datasets, model architectures, and attack parameters, we find that Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, highlighting the significant privacy risks posed by generative model overfitting in real-world applications.
How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
Jiang, Xiangjian, Simidjievski, Nikola, Jamnik, Mateja
Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce $\textbf{TabStruct}$, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at https://github.com/SilenceX12138/TabStruct.
Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models
Ward, Joshua, Wang, Chi-Hua, Cheng, Guang
The promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering data-copying from a privacy threat perspective, not being motivated by recent results in the data-copying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of data-copying and characterizes the corresponding privacy risk. We show that the data-copying identified by DPI poses both privacy and fairness threats to common, high performing architectures; underscoring the necessity for more sophisticated generative modeling techniques to mitigate this issue.