compile
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, model-agnostic agentic framework that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proof-generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 84.9% among sub 8B-parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred.
c182ec594f38926b7fcb827635b9a8f4-Supplemental-Conference.pdf
Let q(Y;Θ) and cK(Y,X) be two smooth, decomposable circuits that are compatible overY then computing their product as a circuit rΘ,K(X,Y) = q(Y;Θ) cK(Y,X) that is decomposable overY can be done inO(|q||c|). Letr(X,Y)beacircuitthat is smooth and decomposable and deterministic overY then for a configurationx its MAP state argmaxyr(x,y)canbecomputedintimeO(|r|). For our experiments we use standard compilation tools toobtain aconstraint circuit starting from a propositional logical formula in conjunctive normal form. We now illustrate step-by-step one example of such a compilation for a simple logical formula. Deterministic sum units representdisjoint solutions to the logical formula, meaning there exists distinct assignments, characterized by the children, that satisfy the logical constraint e.g.
SupplementaryMaterial
A sitting is a meeting of parliament members. While in the virtual environment, you will need to install the specific Gensim1 version needed for theCompassapproach. Inotherinstances,thebeginning of the line that specifies the speaker consists of the role of the parliament member, for example "SPEAKEROFTHEPARLIAMENT" (meaning the member of parliament presiding), followed, but not always, by the actual full name of the person in parenthesis. Theidisa unique number we assigned to each file. Themainchallenge of translating the files from Greek to English was the conversion of the Greek alphabetic numeralstoindo-arabicnumerals.
Pogosim -- a Simulator for Pogobot robots
Cazenille, Leo, Macabre, Loona, Bredeche, Nicolas
Pogobots are a new type of open-source/open-hardware robots specifically designed for swarm robotics research. Their cost-effective and modular design, complemented by vibration-based and wheel-based locomotion, fast infrared communication and extensive software architecture facilitate the implementation of swarm intelligence algorithms. However, testing even simple distributed algorithms directly on robots is particularly labor-intensive. Scaling to more complex problems or calibrate user code parameters will have a prohibitively high strain on available resources. In this article we present Pogosim, a fast and scalable simulator for Pogobots, designed to reduce as much as possible algorithm development costs. The exact same code will be used in both simulation and to experimentally drive real robots. This article details the software architecture of Pogosim, explain how to write configuration files and user programs and how simulations approximate or differ from experiments. We describe how a large set of simulations can be launched in parallel, how to retrieve and analyze the simulation results, and how to optimize user code parameters using optimization algorithms.
Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification
This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.
A Proofs
The proof directly follows from Theorem 3.2 from V ergari et al. [75]. Note that O ( |q ||c|) is a loose upperbound and the size of r is in practice smaller [75]. Analogously, the second statement of Theorem 3.1 follows from Proposition A.1 and by recalling For our experiments we use standard compilation tools to obtain a constraint circuit starting from a propositional logical formula in conjunctive normal form. We now illustrate step-by-step one example of such a compilation for a simple logical formula. Deterministic sum units represent disjoint solutions to the logical formula, meaning there exists distinct assignments, characterized by the children, that satisfy the logical constraint e.g.
Supplementary Material
The dataset includes 1,280,918 speech fragments of Greek parliament members in debate order exported from 5,355 parliamentary sitting record files, with a total volume of 2.12 GB. The speeches extend chronologically from July 1989 up to July 2020. Table 1 shows the contents of the dataset. The names of the speakers are provided in the format "last_name patronym first_name (nickname)". In cases with more than one first or last names, the names that belong to the same category (first or last) are connected with a dash, e.g., "merk-ouri stamatiou amalia-maria (melina)". A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. A session is a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution.
Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming
Biberstein, Paul, Li, Ziyang, Devietti, Joseph, Naik, Mayur
Neurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component with a slower symbolic component that runs on CPUs. We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by APM allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes. We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 8 applications, Lobster achieves an average speedup of 5.3x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.
SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
Zhang, Ruisi, Liu, Tianyu, Feng, Will, Gu, Andrew, Purandare, Sanket, Liang, Wanchao, Massa, Francisco
Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique $torch.compile$-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Lee, Yejin, Sun, Anna, Hosmer, Basil, Acun, Bilge, Balioglu, Can, Wang, Changhan, Hernandez, Charles David, Puhrsch, Christian, Haziza, Daniel, Guessous, Driss, Massa, Francisco, Kahn, Jacob, Wan, Jeffrey, Reizenstein, Jeremy, Zhai, Jiaqi, Isaacson, Joe, Schlosser, Joel, Pino, Juan, Sadagopan, Kaushik Ram, Shamis, Leonid, Ma, Linjian, Hwang, Min-Jae, Chen, Mingda, Elhoushi, Mostafa, Rodriguez, Pedro, Pasunuru, Ram, Yih, Scott, Popuri, Sravya, Liu, Xing, Wu, Carole-Jean
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.