Reasoning Is Not a Race: When Stopping Early Beats Going Deeper

Jun-20-2026, 15:47:59 GMT–Neural Information Processing Systems

We study the use of Process Reward Models (PRMs) for guiding Long Chain-ofThought (CoT) reasoning in large language models. Although PRMs deliver finegrained feedback in standard tasks, PRM-guided beam search does not consistently outperform PRM-free approaches in long CoT reasoning. We trace this shortfall to a "step quality degradation"--the expected step quality shows concave behavior, yielding unimodal or monotonically declining trends. To counteract this, we propose Z-Score Guided Early Stopping (ZGES), which halts search at the detected quality peak using local PRM-reward z-scores. Across multiple math benchmarks and model scales, ZGES outperforms both standard PRM-guided beam search and the PRM-free methods.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Jun-20-2026, 15:47:59 GMT

Conferences PDF

Add feedback

Genre:
- Workflow (0.68)
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning > Search (0.93)
  - Natural Language > Large Language Model (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found