MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection

Paim, Kayua Oleques, Nogueira, Angelo Gaspar Diniz, Kreutz, Diego, Cordeiro, Weverton, Mansilha, Rodrigo Brandao

arXiv.org Artificial Intelligence 

High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-V AE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications. I. Introduction Modern machine learning algorithms, particularly deep learning architectures, depend on large-scale datasets with reliable annotations to achieve optimal performance.