Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

Zhang, Hengrui, Fang, Liancheng, Wu, Qitian, Yu, Philip S.

arXiv.org Artificial Intelligence 

Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. DAR) to address these issues. DAR employs a diffusion model to parameterize the conditional distribution of continuous features. DAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. DAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. DAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects. The code is available at https://github.com/fangliancheng/TabDAR. Today is a good day! Figure 1: Challenges in Auto-Regressive tabular data generation. Due to the widespread application of synthetic tabular data in real-world scenarios, such as data augmentation, privacy protection, and missing value prediction (Fonseca & Bacao, 2023; Assefa et al., 2021; Hernandez et al., 2022), an increasing number of studies have begun to focus on deep generative models for synthetic tabular data generation. In this domain, various approaches, including Variational Autoencoders (VAEs)(Liu et al., 2023), Generative Adversarial Networks (GANs)(Xu et al., 2019), Diffusion Models (Zhang et al., 2024b), and even Large Language Models (LLMs)(Borisov et al., 2023), have demonstrated significant progress.