ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
Chen, Qiaoling, Liu, Zijun, Sun, Peng, Li, Shenggui, Wang, Guoteng, Liu, Ziming, Wen, Yonggang, Feng, Siyuan, Zhang, Tianwei
–arXiv.org Artificial Intelligence
We identify three critical gaps that hinder the na ıve integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. Among these stages, generation is consistently the dominant bottleneck (Zhong et al., 2025a). Classic policy optimization methods such as PPO (Schul-man et al., 2017; 2015) combine trajectory-level rewards A natural optimization to address this bottleneck is speculative decoding (SD) (Leviathan et al., 2023; Chen et al., SD has already been widely adopted in LLM serving systems (e.g., SGLang (Zheng et al., 2024), Among various SD variants, EAGLE-3 (Li et al., 2025) represents the current state of the art, achieving the Overview of SD in RL training and our proposed ReSpec system. Consequently, a single static SD configuration cannot provide reliable speedups across diverse RL workloads, as shown in Figure 3. (G2) Drafter staleness under continual actor updates. As the actor (i.e., target) model evolves with each policy update, a fixed drafter rapidly becomes misaligned with the actor Moreover, this variance increases with drafter staleness (G2), causing a higher ratio of impoverished trajectories.
arXiv.org Artificial Intelligence
Oct-31-2025
- Country:
- Asia
- China
- Middle East > Jordan (0.04)
- Singapore > Central Region
- Singapore (0.04)
- North America > United States (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education (0.46)
- Technology: