S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Zhang, Wenyuan, Nie, Shuaiyi, Zhang, Xinghua, Zhang, Zefeng, Liu, Tingwen

May-28-2025–arXiv.org Artificial Intelligence

We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

May-28-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment (0.68)
- Media > Music (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found