How Reinforcement Learning After Next-Token Prediction Facilitates Learning

Tsilivis, Nikolaos, Malach, Eran, Ullrich, Karen, Kempe, Julia

Oct-14-2025–arXiv.org Machine Learning

Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought'' sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - New South Wales > Sydney (0.04)
- North America
  - United States
    - New York (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California
      - San Diego County > San Diego (0.04)
      - Los Angeles County > Long Beach (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Germany > Berlin (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - France > Auvergne-Rhône-Alpes
    - Lyon > Lyon (0.04)
- Asia > India
  - Karnataka > Bengaluru (0.04)
- Africa > Rwanda
  - Kigali > Kigali (0.04)

Genre:
- Research Report (0.81)

Industry:
- Education > Educational Setting (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Neural Networks > Deep Learning (1.00)