MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs Yingjia Wan 2 Jingyao Li1
–Neural Information Processing Systems
Large language models (LLMs) have shown increasing capability in problemsolving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models).
Neural Information Processing Systems
Mar-27-2025, 10:33:48 GMT
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe
- Austria > Vienna (0.14)
- Middle East > Malta (0.13)
- North America > United States
- California (0.13)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Texas (0.13)
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Workflow (0.73)
- Research Report
- Industry:
- Education > Curriculum
- Subject-Specific Education (1.00)
- Health & Medicine
- Diagnostic Medicine (1.00)
- Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Materials > Chemicals (0.67)
- Education > Curriculum
- Technology: