Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

Safdar, Rijha, Mateen, Danyail, Ali, Syed Taha, Ashfaq, M. Umer, Hussain, Wajahat

arXiv.org Artificial Intelligence 

Abstract--AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize. For this purpose, we introduce V ulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mislabeled and duplicate samples, updating new vulnerabilities, incorporating additional metadata, integrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detection. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other models in terms of accuracy and generalization. Our model achieves 6.8% improvement in recall on the benchmark BigV ul dataset and outperforms others on unseen projects, demonstrating enhanced generalizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our findings suggest a direction for future systems with high cross-project effectiveness. With the rapid growth in digitization and software applications and systems in recent years, the issue of software vulnerabilities has become a critical concern. In 2024, a record-breaking 40,000 Common Vulnerabilities and Exposures (CVEs) were published--an average of 108 per day--marking a 38% increase over 2023 (with 28,818 CVEs) [1]. This number is already dramatically increasing: the first half of 2025 has witnessed an average of 131 CVEs per day [2]. In the open-source software ecosystem, which underpins a wide range of industries, including finance, energy, aerospace, and healthcare, a recent study found a surge of 98% per year in reported vulnerabilities [3]. R. Safdar, S.T. Ali and W . Hussain are with School of Electrical Engineering and Computer Science, National University of Sciences and Technology, Islamabad, Pakistan, 44000.