Benchmarking Large Language Models via Random Variables

Hong, Zijin, Wu, Hao, Dong, Su, Dong, Junnan, Xiao, Yilin, Zhang, Yujing, Wang, Zhu, Huang, Feiran, Li, Linyi, Yang, Hongxia, Huang, Xiao

Feb-17-2025–arXiv.org Artificial Intelligence

Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data contamination. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of large language models (LLMs) in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing benchmarks, but the variable combinations are randomized, making it "unseen" by the LLMs. Models must completely understand the question pattern of the original problem to correctly answer RV questions with various variable values. As a result, the LLM's genuine capability in mathematical reasoning is reflected by its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1000 RV questions. Our findings suggest that LLMs exhibit an imbalance in proficiency between encountered and "unseen" data domains. Proficiency generalization across similar mathematical reasoning tasks is verified to be limited by accuracy and robustness, but it can still be enhanced through test-time scaling.

artificial intelligence, large language model, random variable, (1 more...)

arXiv.org Artificial Intelligence

Feb-17-2025

arXiv.org Web Page

Add feedback

Genre:
- Research Report > New Finding (0.53)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)