finset
RLMEval: Evaluating Research-Level Neural Theorem Proving
Poiroux, Auguste, Bosselut, Antoine, Kunčak, Viktor
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
Gu, Alex, Piotrowski, Bartosz, Gloeckle, Fabian, Yang, Kaiyu, Markosyan, Aram H.
Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods -- mainly agentic scaffolding with off-the-shelf LLMs -- struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California (0.04)
- Europe > Germany > Berlin (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
Xin, Ran, Zheng, Zeyu, Nie, Yanchen, Yuan, Kun, Xiao, Xia
The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.
Autoformalizer with Tool Feedback
Guo, Qi, Wang, Jianing, Zhang, Jianfei, Kong, Deyang, Huang, Xiangzhou, Xi, Xiangyu, Wang, Wei, Wang, Jingang, Cai, Xunliang, Zhang, Shikun, Ye, Wei
Autoformalization addresses the scarcity of data for Automated Theorem Proving (A TP) by translating mathematical problems from natural language into formal statements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (A TF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of A TF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that A TF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that A TF demonstrates excellent inference scaling properties. Recent advancements in the reasoning capabilities of large language models have significantly accelerated progress in the field of Automated Theorem Proving (A TP) (Y ang et al., 2024). Unlike traditional mathematical tasks, A TP requires models to start from a formalized theorem statement and construct rigorous logical proofs that can be verified within formal languages such as Lean (De Moura et al., 2015) and Isabelle (Paulson, 1994). However, the training of recent massive provers, such as DeepSeek-Prover (Ren et al., 2025) and Kimina-Prover (Wang et al., 2025), is hindered by the scarcity of formalized mathematical queries. Autoformalization addresses this by translating mathematical problems expressed in natural language into verifiable formal statements. A significant challenge in autoformalization is the absence of a universal automatic evaluation standard (Li et al., 2024b).
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
From Axioms to Algorithms: Mechanized Proofs of the vNM Utility Theorem
This paper presents a comprehensive formalization of the von Neumann-Morgenstern (vNM) expected utility theorem using the Lean 4 interactive theorem prover. We implement the classical axioms of preference-completeness, transitivity, continuity, and independence-enabling machine-verified proofs of both the existence and uniqueness of utility representations. Our formalization captures the mathematical structure of preference relations over lotteries, verifying that preferences satisfying the vNM axioms can be represented by expected utility maximization. Our contributions include a granular implementation of the independence axiom, formally verified proofs of fundamental claims about mixture lotteries, constructive demonstrations of utility existence, and computational experiments validating the results. We prove equivalence to classical presentations while offering greater precision at decision boundaries. This formalization provides a rigorous foundation for applications in economic modeling, AI alignment, and management decision systems, bridging the gap between theoretical decision theory and computational implementation.
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
Liu, Junqi, Lin, Xiaohan, Bayer, Jonas, Dillies, Yael, Jiang, Weijie, Liang, Xiaodan, Soletskyi, Roman, Wang, Haiming, Xie, Yunzhou, Xiong, Beibei, Yang, Zhengfeng, Zhang, Jujian, Zhi, Lihong, Li, Jia, Liu, Zhengying
Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for $\textbf{F}$ill-in-the-blank $\textbf{in}$ L$\textbf{e}$an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both ``with solution'' and ``without solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (3 more...)
- Research Report (0.50)
- Instructional Material > Course Syllabus & Notes (0.34)
Diverse Inference and Verification for Advanced Reasoning
Drori, Iddo, Longhitano, Gaston, Mao, Mao, Hyun, Seunghwan, Zhang, Yuke, Park, Sungjun, Meeks, Zachary, Zhang, Xin-Yu, Segev, Ben, Yong, Howard, Verma, Nakul, Shporer, Avi, Amit, Alon, Udell, Madeleine
Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Workflow (0.93)
- Research Report (0.63)
- Leisure & Entertainment > Games (1.00)
- Education > Educational Setting (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- (2 more...)
An In-Context Learning Agent for Formal Theorem-Proving
Thakur, Amitayush, Tsoukalas, George, Wen, Yeming, Xin, Jimmy, Chaudhuri, Swarat
We present an in-context learning agent for formal theorem-proving in environments like Lean and Coq. Current state-of-the-art models for the problem are finetuned on environment-specific proof data. By contrast, our approach, called COPRA, repeatedly asks a high-capacity, general-purpose large language model (GPT-4) to propose tactic applications from within a stateful backtracking search. Proposed tactics are executed in the underlying proof environment. Feedback from the execution is used to build the prompt for the next model query, along with selected information from the search history and lemmas retrieved from an external database. We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project. On these benchmarks, COPRA significantly outperforms few-shot invocations of GPT-4. It also compares favorably against finetuning-based approaches, outperforming REPROVER, a state-of-the-art finetuned approach for Lean, in terms of the pass@1 metric. Our code and data are available at https://github.com/trishullab/copra.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > North Sea > Central North Sea (0.04)
- (2 more...)