LongReasonArena: A Long Reasoning Benchmark for Large Language Models
Ding, Jiayu, Ma, Shuming, Cui, Lei, Zheng, Nanning, Wei, Furu
–arXiv.org Artificial Intelligence
Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.
arXiv.org Artificial Intelligence
Aug-28-2025
- Country:
- Asia > China
- Guangxi Province > Nanning (0.04)
- Shaanxi Province > Xi'an (0.04)
- Europe > Austria
- Vienna (0.14)
- Asia > China
- Genre:
- Research Report > New Finding (0.34)
- Technology: