Logic & Formal Reasoning
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
We present PutnamBench, a new multi-language benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems. PutnamBench consists of 1692 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the problems have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations. PutnamBench requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers.
Automata Learning of Preferences over Temporal Logic Formulas from Pairwise Comparisons
Many preference elicitation algorithms consider preference over propositional logic formulas or items with different attributes. In sequential decision making, a user's preference can be a preorder over possible outcomes, each of which is a temporal sequence of events. This paper considers a class of preference inference problems where the user's unknown preference is represented by a preorder over regular languages (sets of temporal sequences), referred to as temporal goals. Given a finite set of pairwise comparisons between finite words, the objective is to learn both the set of temporal goals and the preorder over these goals. We first show that a preference relation over temporal goals can be modeled by a Preference Deterministic Finite Automaton (PDFA), which is a deterministic finite automaton augmented with a preorder over acceptance conditions. The problem of preference inference reduces to learning the PDFA. This problem is shown to be computationally challenging, with the problem of determining whether there exists a PDFA of size smaller than a given integer $k$, consistent with the sample, being NP-Complete. We formalize the properties of characteristic samples and develop an algorithm that guarantees to learn, given a characteristic sample, the minimal PDFA equivalent to the true PDFA from which the sample is drawn. We present the method through a running example and provide detailed analysis using a robotic motion planning problem.
Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?
Jiang, Jin, Wang, Jianing, Yan, Yuchen, Liu, Yang, Zhu, Jianhua, Zhang, Mengdi, Cai, Xunliang, Gao, Liangcai
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.
Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language
Li, Naiqi, Liu, Peiyuan, Liu, Zheng, Dai, Tao, Jiang, Yong, Xia, Shu-Tao
Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: https://github.com/naiqili/Logic-of-Thought.
Data-driven Verification of Procedural Programs with Integer Arrays
Bouajjani, Ahmed, Boutglay, Wael-Amine, Habermehl, Peter
We address the problem of verifying automatically procedural programs manipulating parametric-size arrays of integers, encoded as a constrained Horn clauses solving problem. We propose a new algorithmic method for synthesizing loop invariants and procedure pre/post-conditions represented as universally quantified first-order formulas constraining the array elements and program variables. We adopt a data-driven approach that extends the decision tree Horn-ICE framework to handle arrays. We provide a powerful learning technique based on reducing a complex classification problem of vectors of integer arrays to a simpler classification problem of vectors of integers . The obtained classifier is generalized to get universally quantified invariants and procedure pre/post-conditions. We have implemented our method and shown its efficiency and competitiveness w.r.t.
Transductively Informed Inductive Program Synthesis
Zenkner, Janis, Sesterhenn, Tobias, Bartelt, Christian
Abstraction and reasoning in program synthesis has seen significant progress through both inductive and transductive paradigms. Inductive approaches generate a program or latent function from input-output examples, which can then be applied to new inputs. Transductive approaches directly predict output values for given inputs, effectively serving as the function themselves. Current approaches combine inductive and transductive models via isolated ensembling, but they do not explicitly model the interaction between both paradigms. In this work, we introduce \acs{tiips}, a novel framework that unifies transductive and inductive strategies by explicitly modeling their interactions through a cooperative mechanism: an inductive model generates programs, while a transductive model constrains, guides, and refines the search to improve synthesis accuracy and generalization. We evaluate \acs{tiips} on two widely studied program synthesis domains: string and list manipulation. Our results show that \acs{tiips} solves more tasks and yields functions that more closely match optimal solutions in syntax and semantics, particularly in out-of-distribution settings, yielding state-of-the-art performance. We believe that explicitly modeling the synergy between inductive and transductive reasoning opens promising avenues for general-purpose program synthesis and broader applications.
HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement
Hu, Jilin, Zhang, Jianyu, Zhao, Yongwang, Ringer, Talia
Formal methods is pivotal for verifying the reliability of critical systems through rigorous mathematical proofs. However, its adoption is hindered by labor-intensive manual proofs and the expertise required to use theorem provers. Recent advancements in large language models (LLMs) offer new opportunities for automated theorem proving. Two promising approaches are generating tactics step by step and generating a whole proof directly with an LLM. However, existing work makes no attempt to combine the two approaches. In this work, we introduce HybridProver, a dual-model proof synthesis framework that combines tactic-based generation and whole-proof synthesis to harness the benefits of both approaches. HybridProver generates whole proof candidates for evaluation directly, then extracts proof sketches from those candidates. It then uses a tactic-based generation model that integrates automated tools to complete the sketches via stepwise refinement. We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized Isabelle datasets. Evaluation on the miniF2F dataset illustrates HybridProver's effectiveness. We achieve a 59.4% success rate on miniF2F, where the previous SOTA is 56.1%. Our ablation studies show that this SOTA result is attributable to combining whole-proof and tactic-based generation. Additionally, we show how the dataset quality, training parameters, and sampling diversity affect the final result during automated theorem proving with LLMs. All of our code, datasets, and LLMs are open source.
On the Boolean Network Theory of Datalog$^\neg$
Trinh, Van-Giang, Benhamou, Belaid, Soliman, Sylvain, Fages, François
Datalog$^\neg$ is a central formalism used in a variety of domains ranging from deductive databases and abstract argumentation frameworks to answer set programming. Its model theory is the finite counterpart of the logical semantics developed for normal logic programs, mainly based on the notions of Clark's completion and two-valued or three-valued canonical models including supported, stable, regular and well-founded models. In this paper we establish a formal link between Datalog$^\neg$ and Boolean network theory first introduced for gene regulatory networks. We show that in the absence of odd cycles in a Datalog$^\neg$ program, the regular models coincide with the stable models, which entails the existence of stable models, and in the absence of even cycles, we prove the uniqueness of stable partial models and regular models. This connection also gives new upper bounds on the numbers of stable partial, regular, and stable models of a Datalog$^\neg$ program using the cardinality of a feedback vertex set in its atom dependency graph. Interestingly, our connection to Boolean network theory also points us to the notion of trap spaces. In particular we show the equivalence between subset-minimal stable trap spaces and regular models.
Modelling Real-time Systems with Bigraphs
Albalwe, Maram, Archibald, Blair, Sevegnani, Michele
Bigraphical Reactive Systems (BRSs) are a graph-rewriting formalism describing systems evolving in two dimensions: spatially, e.g. a person in a room, and non-spatially, e.g. mobile phones communicating regardless of location. Despite use in domains including communication protocols, agent programming, biology, and security, there is no support for real-time systems. We extend BRSs to support real-time systems with a modelling approach that uses multiple perspectives to represent digital clocks. We use Action BRSs, a recent extension of BRSs, where the resulting transition system is a Markov Decision Process (MDP). This allows a natural representation of the choices in each system state: to either allow time to pass or perform a specific action. We implement our proposed approach using the BigraphER toolkit, and demonstrate the effectiveness through multiple examples including modelling cloud system requests.
Learning Probabilistic Temporal Logic Specifications for Stochastic Systems
Roy, Rajarshi, Pote, Yash, Parker, David, Kwiatkowska, Marta
There has been substantial progress in the inference of formal behavioural specifications from sample trajectories, for example using Linear Temporal Logic (L TL). However, these techniques cannot handle specifications that correctly characterise systems with stochastic behaviour, which occur commonly in reinforcement learning and formal verification. We consider the passive learning problem of inferring a Boolean combination of probabilistic L TL (PL TL) formulas from a set of Markov chains, classified as either positive or negative. We propose a novel learning algorithm that infers concise PL TL specifications, leveraging grammar-based enumeration, search heuristics, probabilistic model checking and Boolean set-cover procedures. We demonstrate the effectiveness of our algorithm in two use cases: learning from policies induced by RL algorithms and learning from variants of a probabilistic model. In both cases, our method automatically and efficiently extracts PL TL specifications that succinctly characterize the temporal differences between the policies or model variants.