a797c2d2e0c1fdabf4d1ab8cd0b465c6-Paper-Conference.pdf
–Neural Information Processing Systems
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base models entropy patterns, primarily adjusting the entropy of high-entropy tokens.
Neural Information Processing Systems
Jun-21-2026, 10:02:44 GMT
- Technology: