precond
VERINA: Benchmarking Verifiable Code Generation
Ye, Zhe, Yan, Zhengxu, He, Jingxuan, Kasriel, Timothe, Yang, Kaiyu, Song, Dawn
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.
Planification par fusions incr\'ementales de graphes
Pellier, Damien, Belaidi, lias.
In this paper, we introduce a generic and fresh model for distributed planning called "Distributed Planning Through Graph Merging" ({\sf DPGM}). This model unifies the different steps of the distributed planning process into a single step. Our approach is based on a planning graph structure for the agent reasoning and a CSP mechanism for the individual plan extraction and the coordination. We assume that no agent can reach the global goal alone. Therefore the agents must cooperate, {\it i.e.,} take in into account potential positive interactions between their activities to reach their common shared goal. The originality of our model consists in considering as soon as possible, {\it i.e.,} in the individual planning process, the positive and the negative interactions between agents activities in order to reduce the search cost of a global coordinated solution plan.
Integration of Online Learning into HTN Planning for Robotic Tasks
Magnenat, Stéphane (ETH Zurich) | Chappelier, Jean-Cédric (EPFL) | Mondada, Francesco (EPFL)
This paper extends hierarchical task network (HTN) planning with lightweight learning, considering that in robotics, actions have a non-zero probability of failing. Our work applies to A*-based HTN planners with lifting. We prove that the planner finds the plan of maximal expected utility, while retaining its lifting capability and efficient heuristic-based search. We show how to learn the probabilities online, which allows a robot to adapt by replanning on execution failures. The idea behind this work is to use the HTN domain to constrain the space of possibilities, and then to learn on the constrained space in a way requiring few training samples, rendering the method applicable to autonomous mobile robots.
Translating HTNs to PDDL: A Small Amount of Domain Knowledge Can Go a Long Way
Alford, Ronald Wayne (University of Maryland, College Park) | Kuter, Ugur (University of Maryland, College Park) | Nau, Dana (University of Maryland, College Park)
We show how to translate HTN domain descriptions (if they satisfy certain restrictions) into PDDL so that they can be used by classical planners. We provide correctness results for our translation algorithm, and show that it runs in linear time and space. We also show that even small and incomplete amounts of HTN knowledge, when translated into PDDL using our algorithm, can greatly improve a classical planner's performance. In experiments on several thousand randomly generated problems in three different planning domains, such knowledge speeded up the well-known Fast-Forward planner by several orders of magnitude, and enabled it to solve much larger problems than it could otherwise solve.
Accelerating Partial-Order Planners: Some Techniques for Effective Search Control and Pruning
We propose some domain-independent techniques for bringing well-founded partial-order planners closer to practicality. The first two techniques are aimed at improving search control while keeping overhead costs low. One is based on a simple adjustment to the default A* heuristic used by UCPOP to select plans for refinement. The other is based on preferring ``zero commitment'' (forced) plan refinements whenever possible, and using LIFO prioritization otherwise. A more radical technique is the use of operator parameter domains to prune search. These domains are initially computed from the definitions of the operators and the initial and goal conditions, using a polynomial-time algorithm that propagates sets of constants through the operator graph, starting in the initial conditions. During planning, parameter domains can be used to prune nonviable operator instances and to remove spurious clobbering threats. In experiments based on modifications of UCPOP, our improved plan and goal selection strategies gave speedups by factors ranging from 5 to more than 1000 for a variety of problems that are nontrivial for the unmodified version. Crucially, the hardest problems gave the greatest improvements. The pruning technique based on parameter domains often gave speedups by an order of magnitude or more for difficult problems, both with the default UCPOP search strategy and with our improved strategy. The Lisp code for our techniques and for the test problems is provided in on-line appendices.