Goto

Collaborating Authors

 olympiad


Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

Neural Information Processing Systems

Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematical skills.



DOoM: Difficult Olympiads of Math

Kuleshov, Ilya, Pavel, Ilin, Kompanets, Nikolay, Sycheva, Ksenia, Nikolich, Aleksandr

arXiv.org Artificial Intelligence

This paper introduces DOoM, a new open-source benchmark designed to assess the capabilities of language models in solving mathematics and physics problems in Russian. The benchmark includes problems of varying difficulty, ranging from school-level tasks to university Olympiad and entrance exam questions. In this paper we discuss the motivation behind its creation, describe dataset's structure and evaluation methodology, and present initial results from testing various models. Analysis of the results shows a correlation between model performance and the number of tokens used, and highlights differences in performance between mathematics and physics tasks.


miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Ospanov, Azim, Farnia, Farzan, Yousefzadeh, Roozbeh

arXiv.org Artificial Intelligence

We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.


Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Pandit, Shrey, Xu, Austin, Nguyen, Xuan-Phi, Ming, Yifei, Xiong, Caiming, Joty, Shafiq

arXiv.org Artificial Intelligence

Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2V erify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2V erify is designed to rigorously assess step-level verifiers at the frontier: V erifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.Figure 1: Comparison of models evaluated on both ProcessBench (Zheng et al., 2024a) and our Hard2V erify benchmark. Past benchmarks do not sufficiently evaluate in the frontier-level math settings that Hard2V erify does; On the same error identification task, Qwen2.5-Math-PRM-72B Mathematical reasoning serves as a gold-standard evaluation setting for benchmarking reasoning progress in large language models (LLMs). Over the past half-decade, benchmarks have been introduced to assess LLMs at the grade-school (Cobbe et al., 2021), high-school (Hendrycks et al., 2021), university (Zhang et al., 2023), and competition math level (MMA, 2025; He et al., 2024a; Gao et al., 2024). However, the progress of mathematical reasoning ability of LLMs has outpaced benchmark creation, with every subsequent release of a frontier LLM saturating new benchmarks, most recently with GPT -5 Pro achieving 96.5%+ on AIME 2024. As a result, recent efforts (Glazer et al., 2024; Phan et al., 2025) have written novel, unseen mathematical questions to test LLMs. 1 This paradigm requires training data with solutions that are easily verifiable, i.e., have solutions that can be easily checked against a known ground-truth by string matching or symbolic checkers. Math benchmarks, for the most part, also adopt the verifiable setup, where a model response is considered correct if its final answer matches the established ground-truth.



EEFSUVA: A New Mathematical Olympiad Benchmark

Khatibi, Nicole N, Radamovich, Daniil A., Brenner, Michael P.

arXiv.org Artificial Intelligence

Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.


RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models

Ghinea, Dragos-Dumitru, Corbeanu, Adela-Nicoleta, Dumitran, Adrian-Marius

arXiv.org Artificial Intelligence

In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domain-specific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in low-resource languages, offering valuable insights for future research and development.


Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Dascalescu, Stefan, Dumitran, Adrian Marius, Vasiluta, Mihai Alexandru

arXiv.org Artificial Intelligence

Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains resource-intensive and challenging for educators. This paper introduces an innovative NLP-driven method leveraging generative AI (large language models) to automate the creation of high-quality test cases for competitive programming assessments. We extensively evaluated our approach on diverse datasets, including 25 years of Romanian Informatics Olympiad (OJI) data for 5th graders, recent competitions hosted on the Kilonova.ro platform, and the International Informatics Olympiad in Teams (IIOT). Our results demonstrate that AI-generated test cases substantially enhanced assessments, notably identifying previously undetected errors in 67% of the OJI 5th grade programming problems. These improvements underscore the complementary educational value of our technique in formative assessment contexts. By openly sharing our prompts, translated datasets, and methodologies, we offer practical NLP-based tools that educators and contest organizers can readily integrate to enhance assessment quality, reduce workload, and deepen insights into learner performance.


AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Wang, Zihan, Chen, Jiaze, Liu, Zhicheng, Mak, Markus, Du, Yidi, Moon, Geonsik, Xu, Luoqi, Tua, Aaron, Peng, Kunshuo, Lu, Jiayi, Xia, Mingfei, Zou, Boqian, Ran, Chenyang, Tian, Guang, Zhu, Shoutai, Duan, Yeheng, Kang, Zhenghui, Lin, Zhenxing, Li, Shangshu, Luo, Qiang, Long, Qingshen, Chen, Zhiyong, Xiao, Yihan, Wu, Yurong, Zan, Daoguang, Fu, Yuyi, Wang, Mingxuan, Ding, Ming

arXiv.org Artificial Intelligence

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.