a797c2d2e0c1fdabf4d1ab8cd0b465c6-Paper-Conference.pdf

Jun-21-2026, 10:02:44 GMT–Neural Information Processing Systems

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base models entropy patterns, primarily adjusting the entropy of high-entropy tokens.

large language model, machine learning, median, (21 more...)

Neural Information Processing Systems

Jun-21-2026, 10:02:44 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found