How Reinforcement Learning After Next-Token Prediction Facilitates Learning
Tsilivis, Nikolaos, Malach, Eran, Ullrich, Karen, Kempe, Julia
Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought'' sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.
Oct-14-2025
- Country:
- Africa > Rwanda
- Asia > India
- Europe
- Austria > Vienna (0.14)
- France > Auvergne-Rhône-Alpes
- Germany > Berlin (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- North America
- Canada > British Columbia
- United States
- California
- Los Angeles County > Long Beach (0.04)
- San Diego County > San Diego (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York (0.04)
- California
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Genre:
- Research Report (0.81)
- Industry:
- Education > Educational Setting (0.46)
- Technology: