Goto

Collaborating Authors

 Logic & Formal Reasoning


ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

arXiv.org Artificial Intelligence

Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The benchmark, and implementation code will be made public soon.


Generative Logic: A New Computer Architecture for Deterministic Reasoning and Knowledge Generation

arXiv.org Artificial Intelligence

We present Generative Logic (GL), a deterministic architecture that starts from user-supplied axiomatic definitions (and, optionally, a list of simple facts for counterexample (CE) construction), written in a minimalist Mathematical Programming Language (MPL), and systematically explores their deductive neighborhood. Definitions are compiled into a distributed grid of simple Logic Blocks (LBs) that exchange messages; whenever the premises of an inference rule unify, a new fact is emitted with full provenance to its sources, yielding replayable, auditable proof graphs. A prototype software implementation instantiates the workflow on first-order Peano arithmetic. Starting only from the Peano axioms, GL enumerates conjectures, applies normalization, type, and CE filter, and automatically reconstructs machine-checkable proofs of foundational arithmetic laws, including associativity and commutativity of addition, associativity and commutativity of multiplication, and distributivity. On commodity hardware, the prover phase requires approximately 7 seconds; a complete run finishes in about 5 minutes. Generated proofs export to navigable HTML so that every inference step can be inspected independently. We outline a hardware-software co-design path toward massively parallel realizations and describe prospective integration with probabilistic models (e.g., large language models) for auto-formalization and conjecture seeding. The Python, C++, and MPL code to reproduce the Peano experiments, along with the full proof graphs in HTML as well as machine-readable text format, are available in the project's GitHub repository at github.com/Generative-Logic/GL commit 56c9233 and are permanently archived at doi:10.5281/zenodo.17206386.


GenesisGeo: Technical Report

arXiv.org Artificial Intelligence

We present GenesisGeo, an automated theorem prover in Euclidean geometry. We have open-sourced a large-scale geometry dataset of 21.8 million geometric problems, over 3 million of which contain auxiliary constructions. Specially, we significantly accelerate the symbolic deduction engine DDARN by 120x through theorem matching, combined with a C++ implementation of its core components. Furthermore, we build our neuro-symbolic prover, GenesisGeo, upon Qwen3-0.6B-Base, which solves 24 of 30 problems (IMO silver medal level) in the IMO-AG-30 benchmark using a single model, and achieves 26 problems (IMO gold medal level) with a dual-model ensemble.


Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models

arXiv.org Artificial Intelligence

We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at https://github.com/future-item/ILP-CoT.


Compiling by Proving: Language-Agnostic Automatic Optimization from Formal Semantics

arXiv.org Artificial Intelligence

Verification proofs encode complete program behavior, yet we discard them after checking correctness. We present compiling by proving, a paradigm that transforms these proofs into optimized execution rules. By constructing All-Path Reachability Proofs through symbolic execution and compiling their graph structure, we consolidate many semantic rewrites into single rules while preserving correctness by construction. We implement this as a language-agnostic extension to the K framework. Evaluation demonstrates performance improvements across different compilation scopes: opcode-level optimizations show consistent speedups, while whole-program compilation achieves orders of magnitude greater performance gains.


Revisiting Formal Methods for Autonomous Robots: A Structured Survey

arXiv.org Artificial Intelligence

This paper presents the initial results from our structured literature review on applications of Formal Methods (FM) to Robotic Autonomous Systems (RAS). We describe our structured survey methodology; including database selection and associated search strings, search filters and collaborative review of identified papers. We categorise and enumerate the FM approaches and formalisms that have been used for specification and verification of RAS. We investigate FM in the context of sub-symbolic AI-enabled RAS and examine the evolution of how FM is used over time in this field. This work complements a pre-existing survey in this area and we examine how this research area has matured over time. Specifically, our survey demonstrates that some trends have persisted as observed in a previous survey. Additionally, it recognized new trends that were not considered previously including a noticeable increase in adopting Formal Synthesis approaches as well as Probabilistic Verification Techniques.


Philosophy-informed Machine Learning

arXiv.org Artificial Intelligence

A deep dive into the open literature shows that there are t hree fundamental limitations to current ML approaches, namely blackbox brittleness (which renders models uninterpretable and unreliable under distribution shift [2]), causal blindness (which conflates correlation with causation [3]), and alignment failures (which produce systems optimizing objectives misaligned with human values [4]) . These deficiencies stem from a profound philosophical poverty in how ML conceptualizes knowledge, reasoning, and values. The first fundamental limitation, b lackbox brittleness, manifests when trained models fail on seemingly trivial variations of their training distribution. For example, a vision model that accurately identifies stop signs under normal conditions might misclassify them entirely when small adversarial perturbations are applied [5] . Not surprisingly, t h e same brittleness extends beyond adversarial examples to everyday distribution shifts (e.g., natural language processing models exhibit performance degradation when processing text from different cultural contexts, etc.) [6] .


REAMS: Reasoning Enhanced Algorithm for Maths Solving

arXiv.org Artificial Intelligence

The challenges of solving complex university-level mathematics problems, particularly those from MIT, and Columbia University courses, and selected tasks from the MATH dataset, remain a significant obstacle in the field of artificial intelligence. Conventional methods have consistently fallen short in this domain, highlighting the need for more advanced approaches. In this paper, we introduce a language-based solution that leverages zero-shot learning and mathematical reasoning to effectively solve, explain, and generate solutions for these advanced math problems. By integrating program synthesis, our method reduces reliance on large-scale training data while significantly improving problem-solving accuracy. Our approach achieves an accuracy of 90.15%, representing a substantial improvement over the previous benchmark of 81% and setting a new standard in automated mathematical problem-solving. These findings highlight the significant potential of advanced AI methodologies to address and overcome the challenges presented by some of the most complex mathematical courses and datasets.


Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning

arXiv.org Artificial Intelligence

We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.


Substrate-Timing-Independence for Meta-State Stability of Distributed Robotic Swarms

arXiv.org Artificial Intelligence

Emergent properties in distributed systems arise due to timing unpredictability; asynchronous state evolution within each sub-system may lead the macro-system to faulty meta-states. Empirical validation of correctness is often prohibitively expensive, as the size of the state-space is too large to be tractable. In robotic swarms this problem is exacerbated, when compared to software systems, by the variability of the implementation substrate across the design, or even the deployment, process. We present an approach for formally reasoning about the correctness of robotic swarm design in a substrate-timing-independent way. By leveraging concurrent process calculi (namely, Communicating Sequential Processes), we introduce a methodology that can automatically identify possible causes of faulty meta-states and correct such designs such that meta-states are consistently stable, even in the presence of timing variability due to substrate changes. We evaluate this approach on a robotic swarm with a clearly identified fault, realized in both simulation and reality. Results support the research hypothesis, showing that the swarm reaches an illegal meta-state before the correction is applied, but behaves consistently correctly after the correction. Our techniques are transferable across different design methodologies, contributing to the toolbox of formal methods for roboticists.