Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

Li, Yingru, Xu, Jiawei, Liu, Jiacai, Tong, Yuxuan, Li, Ziniu, Cai, Tianle, Zhang, Ge, Liu, Qian, Wang, Baoxiang

Dec-30-2025–arXiv.org Machine Learning

Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

Dec-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.94)
  - Machine Learning > Reinforcement Learning (0.61)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found