MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Zheng, Kunhao, Han, Jesse Michael, Polu, Stanislas

arXiv.org Artificial Intelligence 

Shared benchmarks and datasets have historically played a crucial role in driving advances in large-scale applications of deep learning, e.g. in computer vision ([6]) and natural language processing ([19, 13, 11]). Neural theorem proving is a rapidly developing area which aims to apply techniques from deep learning to interactive theorem proving. To date, most contributions in this area have focused on individual theorem proving systems, each with a separately-implemented mathematics library and with results reported on a dataset-specific test split; examples include the HOList [2], CoqGym [24] and LeanStep [7] theorem proving environments and benchmarks. However, benchmarks from this paradigm are not ideal for measuring the mathematical reasoning ability of neural theorem provers for several reasons. Library-specific train/test splits are siloed by construction, dependent on how theorems and lemmas are split in these libraries, and as such are not directly comparable across systems. Moreover, formal mathematics libraries are closer to software repositories than informal mathematical exposition, and many lemmas are implementation-specific artifacts without precise informal mathematical or cross-system translations. To date, the neural theorem proving community has not organized its efforts around a cross-system benchmark. To address this need and to provide a common resource to research groups working on formal theorem proving, we present miniF2F, a unified cross-system benchmark of formal mathematics of progressively increasing difficulty, centering around Olympiad-level problem statements (AMC, AIME, IMO) as well as high-school and undergraduate maths classes.