StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
Lu, Yuchen, Yang, Run, Zhang, Yichen, Yu, Shuguang, Dai, Runpeng, Wang, Ziwei, Xiang, Jiayi, E, Wenxin, Gao, Siran, Ruan, Xinyao, Huang, Yirui, Xi, Chenjing, Hu, Haibo, Fu, Yueming, Yu, Qinglan, Wei, Xiaobing, Gu, Jiani, Sun, Rui, Jia, Jiaxuan, Zhou, Fan
–arXiv.org Artificial Intelligence
Large language models (LLMs) have advanced rapidly in recent years (Brown et al., 2020; Touvron et al., 2023), demonstrating remarkable progress in complex reasoning (Guo et al., 2025), fluent text generation, and even automated proof discovery (Yu et al., 2025). These advances have spurred growing adoption of LLMs across education, data science, and research, where they are increasingly used for tutoring, problem explanation, data analysis, and hypothesis formulation (Wu et al., 2021; Polu and Sutskever, 2020; Khan et al., 2023; Gao et al., 2023). However, despite their broad deployment in quantitative domains, the field of statistics, which forms the foundation of modern data-driven science, has received little attention in LLM evaluation. Statistics differs fundamentally from other quantitative disciplines. Rather than focusing on symbolic manipulation or fixed-form computation, it emphasizes reasoning under uncertainty, connecting probability theory, inference, regression, Bayesian analysis, multivariate methods, and asymptotic theory into a unified framework. Yet existing large-scale LLM evaluations rarely cover these competencies: statistical problems account for less than 3% of recent reasoning benchmarks (Paster et al., 2025), and when included, they are typically treated as isolated probability puzzles without structured categorization or coverage of inferential reasoning (Gao et al., 2024). This gap makes it impossible to rigorously assess whether LLMs can function as capable statisticians or support data-driven scientific discovery. To bridge this critical gap, we introduce StatEval, the first large-scale benchmark dedicated to evaluating large language models on statistical reasoning. With nearly 20,000 meticulously curated problems, StatEval covers the entire spectrum of statistics, from basic undergraduate exercises to advanced research-level challenges, captures the full 2 breadth and depth of the discipline, as illustrated in Figure 1.
arXiv.org Artificial Intelligence
Oct-13-2025