BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

Jasper, Surya, Luu, Minh, Pan, Evan, Tyagi, Aakash, Quinn, Michael, Hu, Jiang, Houngninou, David Kebo

arXiv.org Artificial Intelligence 

--Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour--over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in Open-Titan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen-generated datasets were employed to train MLbased failure triage models, we achieved high classification accuracy (88.1%-93.2%) BugGen thus provides a scalable solution for generating high-quality bug datasets, significantly enhancing verification efficiency and ML-assisted debugging. Modern hardware systems continue to grow in complexity, often incorporating billions of transistors on a single chip. This increased complexity significantly expands the scope and difficulty of design verification tasks. V erification efforts already consume more than half of the total hardware development time [1], with this fraction steadily increasing each year. The heightened complexity also results in large volumes of test failures, particularly during early front-end development and initial volume regressions, placing severe demands on debugging resources to pinpoint root causes or effectively triage these failures.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found