Goto

Collaborating Authors

 Han, Jesse Michael


Formal Mathematics Statement Curriculum Learning

arXiv.org Artificial Intelligence

We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.


Unsupervised Neural Machine Translation with Generative Language Models Only

arXiv.org Artificial Intelligence

We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models. Our method consists of three steps: few-shot amplification, distillation, and backtranslation. We first use the zero-shot translation ability of large pre-trained language models to generate translations for a small set of unlabeled sentences. We then amplify these zero-shot translations by using them as few-shot demonstrations for sampling a larger synthetic dataset. This dataset is distilled by discarding the few-shot demonstrations and then fine-tuning. During backtranslation, we repeatedly generate translations for a set of inputs and then fine-tune a single language model on both directions of the translation task at once, ensuring cycle-consistency by swapping the roles of gold monotext and generated translations when fine-tuning. By using our method to leverage GPT-3's zero-shot translation capability, we achieve a new state-of-the-art in unsupervised translation on the WMT14 English-French benchmark, attaining a BLEU score of 42.1.


MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

arXiv.org Artificial Intelligence

Shared benchmarks and datasets have historically played a crucial role in driving advances in large-scale applications of deep learning, e.g. in computer vision ([6]) and natural language processing ([19, 13, 11]). Neural theorem proving is a rapidly developing area which aims to apply techniques from deep learning to interactive theorem proving. To date, most contributions in this area have focused on individual theorem proving systems, each with a separately-implemented mathematics library and with results reported on a dataset-specific test split; examples include the HOList [2], CoqGym [24] and LeanStep [7] theorem proving environments and benchmarks. However, benchmarks from this paradigm are not ideal for measuring the mathematical reasoning ability of neural theorem provers for several reasons. Library-specific train/test splits are siloed by construction, dependent on how theorems and lemmas are split in these libraries, and as such are not directly comparable across systems. Moreover, formal mathematics libraries are closer to software repositories than informal mathematical exposition, and many lemmas are implementation-specific artifacts without precise informal mathematical or cross-system translations. To date, the neural theorem proving community has not organized its efforts around a cross-system benchmark. To address this need and to provide a common resource to research groups working on formal theorem proving, we present miniF2F, a unified cross-system benchmark of formal mathematics of progressively increasing difficulty, centering around Olympiad-level problem statements (AMC, AIME, IMO) as well as high-school and undergraduate maths classes.


Proof Artifact Co-training for Theorem Proving with Language Models

arXiv.org Artificial Intelligence

Labeled data for imitation learning of theorem proving in large libraries of formalized mathematics is scarce as such libraries require years of concentrated effort by human specialists to be built. This is particularly challenging when applying large Transformer language models to tactic prediction, because the scaling of performance with respect to model size is quickly disrupted in the data-scarce, easily-overfitted regime. We propose PACT ({\bf P}roof {\bf A}rtifact {\bf C}o-{\bf T}raining), a general methodology for extracting abundant self-supervised data from kernel-level proof terms for co-training alongside the usual tactic prediction objective. We apply this methodology to Lean, an interactive proof assistant which hosts some of the most sophisticated formalized mathematics to date. We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32\% to 48\%.


Universal Policies for Software-Defined MDPs

arXiv.org Artificial Intelligence

We introduce a new programming paradigm called oracle-guided decision programming in which a program specifies a Markov Decision Process (MDP) and the language provides a universal policy. We prototype a new programming language, Dodona, that manifests this paradigm using a primitive 'choose' representing nondeterministic choice. The Dodona interpreter returns either a value or a choicepoint that includes a lossless encoding of all information necessary in principle to make an optimal decision. Meta-interpreters query Dodona's (neural) oracle on these choicepoints to get policy and value estimates, which they can use to perform heuristic search on the underlying MDP. We demonstrate Dodona's potential for zero-shot heuristic guidance by meta-learning over hundreds of synthetic tasks that simulate basic operations over lists, trees, Church datastructures, polynomials, first-order terms and higher-order terms.


Enhancing SAT solvers with glue variable predictions

arXiv.org Artificial Intelligence

Modern SAT solvers routinely operate at scales that make it impractical to query a neural network for every branching decision. NeuroCore, proposed by Selsam and Bjorner, offered a proof-of-concept that neural networks can still accelerate SAT solvers by only periodically refocusing a score-based branching heuristic. However, that work suffered from several limitations: their modified solvers require GPU acceleration, further ablations showed that they were no better than a random baseline on the SATCOMP 2018 benchmark, and their training target of unsat cores required an expensive data pipeline which only labels relatively easy unsatisfiable problems. We address all these limitations, using a simpler network architecture allowing CPU inference for even large industrial problems with millions of clauses, and training instead to predict {\em glue variables}---a target for which it is easier to generate labelled data, and which can also be formulated as a reinforcement learning task. We demonstrate the effectiveness of our approach by modifying the state-of-the-art SAT solver CaDiCaL, improving its performance on SATCOMP 2018 and SATRACE 2019 with supervised learning and its performance on a dataset of SHA-1 preimage attacks with reinforcement learning.