A benchmark for vericoding: formally verified program synthesis

Bursuc, Sergiu, Ehrenborg, Theodore, Lin, Shaowei, Astefanoaei, Lacramioara, Chiosa, Ionel Emilian, Kukovec, Jure, Singh, Alok, Butterley, Oliver, Bizid, Adem, Dougherty, Quinn, Zhao, Miranda, Tan, Max, Tegmark, Max

Sep-30-2025–arXiv.org Artificial Intelligence

We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications -- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in V erus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in V erus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared in this GitHub repo. Rapid AI progress has popularized vibe coding, which generates computer programs from natural language descriptions. For example, Google has reported that over 30% of its software is created this way (Google Earnings Call). Unfortunately, the resulting code can be buggy, and traditional bug hunting with test cases can typically only demonstrate the presence and not the absence of bugs, since there are too many test cases to try them all. For example, major code-testing efforts failed to prevent bugs causing an Ariane-V rocket explosion (Ariane 5 Failure) and an embarrassing security vulnerability in the Bash shell (Shellshock Bug) that was built into the Unix operating system for 25 years before being discovered. The 2024 CrowdStrike outage disrupted 8.5 million devices globally, harming airlines, hospitals, banking, broadcasting, emergency services (CrowdStrike Outage). Fortunately, rigorous correctness guarantees can be created via formal verification, by generating a machine-checkable proof that code meets its human-written specifications.

large language model, logic & formal reasoning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Senegal
  - Dakar Region > Dakar (0.04)
- Asia > China
  - Hong Kong (0.04)
- Europe
  - France > Île-de-France
    - Paris > Paris (0.04)
  - Switzerland (0.04)
- North America > United States
  - California > San Francisco County
    - San Francisco (0.14)
  - Massachusetts (0.04)
  - Texas > Travis County
    - Austin (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Logic & Formal Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found