Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing Systems 

Accessible high-quality data is the bread and butter of machine learning research,1 and the demand for data has exploded as larger and more advanced ML models are2 built across different domains. Yet, real data often contain sensitive information,3 subject to various biases, and are costly to acquire, which compromise their quality4 and accessibility. Synthetic data have thus emerged as a complement, sometimes5 even a replacement, to real data for ML training. However, the landscape of6 synthetic data research has been fragmented due to the large number of data7 modalities (e.g., tabular data, time series data, images, etc.) and various use cases8 (e.g., privacy, fairness, data augmentation, etc.). This poses practical challenges9 in comparing and selecting synthetic data generators in different problem settings.10 To this end, we develop Synthcity, an open-source Python library that allows11 researchers and practitioners to perform one-click benchmarking of synthetic data12 generators across data modalities and use cases. In addition, Synthcity's plug-in13 style API makes it easy to incorporate additional data generators into the framework.14 Beyond benchmarking, it also offers a single access point to a diverse range of15 cutting-edge data generators. Through examples on tabular data generation and16 data augmentation, we illustrate the general applicability of Synthcity, and the17 insight one can obtain.18

Similar Docs  Excel Report  more

TitleSimilaritySource
None found