Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Zhou, Jin Peng, Staats, Charles, Li, Wenda, Szegedy, Christian, Weinberger, Kilian Q., Wu, Yuhuai

arXiv.org Artificial Intelligence 

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv. Recently, language models (Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) have advanced significantly in many natural language processing tasks such as machine translation, question answering, summarization, etc. More recent large language models (LLMs) such as Minerva (Lewkowycz et al., 2022), GPT3.5 (OpenAI) and GPT4 (OpenAI, 2023) have also become increasingly capable of solving quantitative reasoning problems, ranging from middle school math word problems (Cobbe et al., 2021) to challenging high school mathematical competition problems (Hendrycks et al., 2021). By training or finetuning the model on high-quality natural language mathematical and scientific text, these LLMs can generate self-contained step-by-step solutions to quantitative reasoning problems without relying on external tools. However, just like human beings, the solutions LLMs generate are prone to simple calculation errors and unjustified logical leaps. Following Wang et al. (2022); Lewkowycz et al. (2022), one can sample many proposed solutions, extract the final answer from each, and select the most common answer. While aggregating answers like this improves performance at the problem level, the most common answer is sometimes wrong. Ideally, we would like a better heuristic to identify the correct answer.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found