Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Yuan, Xiangchi, Chen, Xiang, Yu, Tong, Shi, Dachuan, Jin, Can, Lee, Wenke, Mitra, Saayan

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfit-ting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoT A) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoT A, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training. Recent Large Language Models (LLMs) has shown reasoning capability (Jaech et al., 2024; Guo et al., 2025; Anthropic, 2025). The reasoning capability are highly dependent on the use of the Chain-of-Thought (CoT) thinking pattern trained by supervise fine-tuning (SFT) or reinforcement learning (RL). Although popular RL algorithms such as PPO (Schulman et al., 2017), GRPO (Guo et al., 2025), and DAPO (Y u et al., 2025) are promising in multiple reasoning tasks, recent studies argue that RL training does not truly extend a model's reasoning boundaries.