Algorithmic Thinking Theory

Bateni, MohammadHossein, Cohen-Addad, Vincent, Gu, Yuzhou, Lattanzi, Silvio, Meierhans, Simon, Mohri, Christopher

arXiv.org Artificial Intelligence 

Initial challenges, such as grade-school mathematics (GSM8K) and standard competition math (MATH dataset), have largely been surmounted, pushing the frontier of AI reasoning toward "grand challenge" problems, such as those found in the International Mathematical Olympiad (IMO). These problems, renowned for their demand for deep insight, creativity, and rigorous proof, expose a fascinating weakness in modern LLMs. While a model's performance on a single attempt (termed pass@1) may be very low, its ability to produce a correct answer within k attempts (pass@k) can be significantly higher. This pass@1 versus pass@k gap, especially pronounced when sampling with high temperature to produce diverse outputs, suggests that models possess a vast, latent capability that is not accessible in a single, high-confidence generation. Interestingly, to recover the full power of the model it is not sufficient to simply use multiple attempts. In fact, even the pass@k metric fails to capture the full story. On the most difficult problems, simply sampling k times and selecting the best answer (e.g., "best-of-32") still yields poor results. For instance, Huang and Yang (2025) report that a best-of-32 baseline on the IMO 2025 problems achieved an accuracy of only 31.6-38.1% for leading models [HY25]. This paradox lies at the heart of our work: the latent capability of LLMs is not merely a matter of selection (finding one correct needle in a haystack of k attempts), but one of synthesis.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found