U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Chernyshev, Konstantin, Polshkov, Vitaliy, Artemova, Ekaterina, Myasnikov, Alex, Stepanov, Vlad, Miasnikov, Alexei, Tilga, Sergei

arXiv.org Artificial Intelligence 

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and highschool problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release µ-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on µ-MATH. Mathematical reasoning is a fundamental domain for assessing the true capabilities of Large Language Models (LLMs) to reason (Ahn et al., 2024). While existing benchmarks like GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) provide valuable insights, they primarily focus on schoollevel mathematics. This leaves a significant gap in understanding how LLMs perform on more advanced, university-level problems.