Goto

Collaborating Authors

 dafny


A benchmark for vericoding: formally verified program synthesis

Bursuc, Sergiu, Ehrenborg, Theodore, Lin, Shaowei, Astefanoaei, Lacramioara, Chiosa, Ionel Emilian, Kukovec, Jure, Singh, Alok, Butterley, Oliver, Bizid, Adem, Dougherty, Quinn, Zhao, Miranda, Tan, Max, Tegmark, Max

arXiv.org Artificial Intelligence

We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications -- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in V erus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in V erus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared in this GitHub repo. Rapid AI progress has popularized vibe coding, which generates computer programs from natural language descriptions. For example, Google has reported that over 30% of its software is created this way (Google Earnings Call). Unfortunately, the resulting code can be buggy, and traditional bug hunting with test cases can typically only demonstrate the presence and not the absence of bugs, since there are too many test cases to try them all. For example, major code-testing efforts failed to prevent bugs causing an Ariane-V rocket explosion (Ariane 5 Failure) and an embarrassing security vulnerability in the Bash shell (Shellshock Bug) that was built into the Unix operating system for 25 years before being discovered. The 2024 CrowdStrike outage disrupted 8.5 million devices globally, harming airlines, hospitals, banking, broadcasting, emergency services (CrowdStrike Outage). Fortunately, rigorous correctness guarantees can be created via formal verification, by generating a machine-checkable proof that code meets its human-written specifications.


Formal Verification of Minimax Algorithms

Wesselink, Wieger, Huizing, Kees, van de Wetering, Huub

arXiv.org Artificial Intelligence

Using the Dafny verification system, we formally verify a range of minimax search algorithms, including variations with alpha-beta pruning and transposition tables. For depth-limited search with transposition tables, we introduce a witness-based correctness criterion and apply it to two representative algorithms. All verification artifacts, including proofs and Python implementations, are publicly available.


A Formally Verified Robustness Certifier for Neural Networks (Extended Version)

Tobler, James, Syeda, Hira Taqdees, Murray, Toby

arXiv.org Artificial Intelligence

Neural networks are often susceptible to minor perturbations in input that cause them to misclassify. A recent solution to this problem is the use of globally-robust neural networks, which employ a function to certify that the classification of an input cannot be altered by such a perturbation. Outputs that pass this test are called certified robust. However, to the authors' knowledge, these certification functions have not yet been verified at the implementation level. We demonstrate how previous unverified implementations are exploitably unsound in certain circumstances. Moreover, they often rely on approximation-based algorithms, such as power iteration, that (perhaps surprisingly) do not guarantee soundness. To provide assurance that a given output is robust, we implemented and formally verified a certification function for globally-robust neural networks in Dafny. We describe the program, its specifications, and the important design decisions taken for its implementation and verification, as well as our experience applying it in practice.


Dafny as Verification-Aware Intermediate Language for Code Generation

Li, Yue Chen, Zetzsche, Stefan, Somayyajula, Siva

arXiv.org Artificial Intelligence

We will revisit this example later on. Using large language models (LLMs) to generate source code In this paper, we propose to utilise Dafny as an intermediate from natural language prompts is a popular and promising technology within a code-generating chatbot prototype idea with a wide range of applications. One of its limitations on the way to higher quality mainstream-language code. is that the generated code can be faulty at times, often in a Dafny is particularly well-suited for this purpose, because: subtle way, despite being presented to the user as correct. In this paper, we explore ways in which formal methods can assist with increasing the quality of code generated by an LLM. Instead of emitting code in a target language directly, we propose that the user guides the LLM to first generate 1. It allows for a clear distinction between the specification an opaque intermediate representation, in the verificationaware of a program--derived from a natural language language Dafny, that can be automatically validated prompt using an LLM--and its implementation, which for correctness against agreed on specifications. The correct can again be dialogically derived using an LLM with Dafny program is then compiled to the target language and verification feedback in the loop.


From Scientific Texts to Verifiable Code: Automating the Process with Transformers

Wang, Changjie, Scazzariello, Mariano, Chiesa, Marco

arXiv.org Artificial Intelligence

Despite the vast body of research literature proposing algorithms with formal guarantees, the amount of verifiable code in today's systems remains minimal. This discrepancy stems from the inherent difficulty of verifying code, particularly due to the time-consuming nature and strict formalism of proof details that formal verification tools require. However, the emergence of transformers in Large Language Models presents a promising solution to this challenge. In this position paper, we believe that transformers have the potential to read research papers that propose algorithms with formal proofs and translate these proofs into verifiable code. We leverage transformers to first build a formal structure of the proof using the original text from the paper, and then to handle the tedious, low-level aspects of proofs that are often omitted by humans. We argue that this approach can significantly reduce the barrier to formal verification. The above idea of reading papers to write verifiable code opens new avenues for automating the verification of complex systems, enabling a future where formally verified algorithms from academic research can more seamlessly transition into real-world software systems, thereby improving code reliability and security.


dafny-annotator: AI-Assisted Verification of Dafny Programs

Poesia, Gabriel, Loughridge, Chloe, Amin, Nada

arXiv.org Artificial Intelligence

Formal verification has the potential to drastically reduce software bugs, but its high additional cost has hindered large-scale adoption. While Dafny presents a promise to significantly reduce the effort to write verified programs, users are often required to provide logical annotations to aid the verifier. Here, we explore using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct. On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods. Since this data-driven approach is hindered by the lack of large-scale training data, we propose a method for open-ended synthesis of new Dafny programs in a flexible pipeline where LLMs formulate high-level ideas, implement them, and incrementally propose changes to existing programs, which Dafny validates. This gives us a synthetic dataset, DafnySynth, which we use to augment DafnyBench for training. Fine-tuning on both datasets boosts LLaMa 8B's success rate to 50.6% -- significantly better than the base model, or training on either dataset alone. Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples. In turn, such assistants might reduce friction for users and ultimately drive adoption.


Assured Automatic Programming via Large Language Models

Mirchev, Martin, Costea, Andreea, Singh, Abhishek Kr, Roychoudhury, Abhik

arXiv.org Artificial Intelligence

With the advent of AI-based coding engines, it is possible to convert natural language requirements to executable code in standard programming languages. However, AI-generated code can be unreliable, and the natural language requirements driving this code may be ambiguous. In other words, the intent may not be accurately captured in the code generated from AI-coding engines like Copilot. The goal of our work is to discover the programmer intent, while generating code which conforms to the intent and a proof of this conformance. Our approach to intent discovery is powered by a novel repair engine called program-proof co-evolution, where the object of repair is a tuple (code, logical specification, test) generated by an LLM from the same natural language description. The program and the specification capture the initial operational and declarative description of intent, while the test represents a concrete, albeit partial, understanding of the intent. Our objective is to achieve consistency between the program, the specification, and the test by incrementally refining our understanding of the user intent. Reaching consistency through this repair process provides us with a formal, logical description of the intent, which is then translated back into natural language for the developer's inspection. The resultant intent description is now unambiguous, though expressed in natural language. We demonstrate how the unambiguous intent discovered through our approach increases the percentage of verifiable auto-generated programs on a recently proposed dataset in the Dafny programming language.


Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

Lahiri, Shuvendu K.

arXiv.org Artificial Intelligence

Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the program. The intent or requirement is expressed informally in natural language and the specification is a formal artefact. The advent of large language models (LLMs) has made strides bridging the gap between informal intent and formal program implementations recently, driven in large parts due to benchmarks and automated metrics for evaluation. Recent work has proposed evaluating {\it user-intent formalization} problem for mainstream programming languages~\cite{endres-fse24}. However, such an approach does not readily extend to verification-aware languages that support rich specifications (containing quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate approach of {\it symbolically testing specifications} to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely with mostly GPT-4 generated and human-labeled dataset of roughly 150 Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We believe our work provides a stepping stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs.


VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

Brandfonbrener, David, Henniger, Simon, Raja, Sibi, Prasad, Tarun, Loughridge, Chloe, Cassano, Federico, Hu, Sabrina Ruixin, Yang, Jianang, Byrd, William E., Zinkov, Robert, Amin, Nada

arXiv.org Artificial Intelligence

Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound. In this paper, we present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq. VerMCTS uses a logical verifier in concert with an LLM to guide a modified Monte Carlo Tree Search (MCTS). This approach leverages the verifier to gain intermediate feedback inside the search algorithm by checking partial programs at each step to estimate an upper bound on the value function. To measure the performance of VerMCTS, we develop a new suite of multi-step verified programming problems in Dafny and Coq. In terms of pass@T, a new metric which computes the pass rate given a budget of T tokens sampled from the LLM, VerMCTS leads to more than a 30% absolute increase in average pass@5000 across the suite over repeated sampling from the base language model.


Clover: Closed-Loop Verifiable Code Generation

Sun, Chuyue, Sheng, Ying, Padon, Oded, Barrett, Clark

arXiv.org Artificial Intelligence

The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to any number of undesirable outcomes. In this paper, we lay out a vision for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which reduces correctness checking to the more accessible problem of consistency checking. At the core of Clover lies a checker that performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its feasibility on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset, (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for incorrect ones (no false positives).