SeedAIchemy: LLM-Driven Seed Corpus Generation for Fuzzing

Wen, Aidan, Alzahrani, Norah A., Jiang, Jingzhi, Joe, Andrew, Shieh, Karen, Zhang, Andy, Alomair, Basel, Wagner, David

arXiv.org Artificial Intelligence 

Abstract--We introduce SeedAIchemy, an automated LLMdriven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries. Fuzz testing is a widely used method for improving software security. One of the attractions of fuzz testing is that it is relatively easy to adopt. However, one road bump with adopting fuzz testing is that, for best effectiveness, developers must provide a corpus of seed files. Ideally, these seed files would include many tricky cases and difficult inputs, and would ensure good branch coverage of the targets. Constructing such a corpus can be difficult for developers who are newly adopting fuzz testing or do not have a strong security background.