TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation
Shi, Juntong, Xu, Minkai, Hua, Harper, Zhang, Hengrui, Ermon, Stefano, Leskovec, Jure
–arXiv.org Artificial Intelligence
Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. We further introduce a multi-modal stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Code is available at https://github.com/MinkaiXu/TabDiff. Tabular data is ubiquitous in various databases, and developing effective generative models for it is a fundamental problem in many data processing and analysis tasks, ranging from training data augmentation (Fonseca & Bacao, 2023), data privacy protection (Assefa et al., 2021; Hernandez et al., 2022), to missing value imputation (You et al., 2020; Zheng & Charoenphakdee, 2022). With versatile synthetic tabular data that share the same format and statistical properties as the existing dataset, we are able to completely replace real data in a workflow or supplement the data to enhance its utility, which makes it easier to share and use. The capability of anonymizing data and enlarging sample size without compromising the overall data quality enables it to revolutionize the field of data science.
arXiv.org Artificial Intelligence
Oct-29-2024
- Country:
- North America > United States > California (0.28)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology:
- Information Technology
- Artificial Intelligence > Machine Learning
- Neural Networks (0.67)
- Performance Analysis > Accuracy (0.46)
- Statistical Learning > Regression (0.46)
- Data Science (1.00)
- Security & Privacy (1.00)
- Artificial Intelligence > Machine Learning
- Information Technology