Proximal Supervised Fine-Tuning

Zhu, Wenhong, Xie, Ruobing, Wang, Rui, Sun, Xingwu, Wang, Di, Liu, Pengfei

arXiv.org Artificial Intelligence 

Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages . Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization. Recently, post-training has become a crucial part of the overall training process. In particular, reinforcement learning (RL) algorithms, such as PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), have demonstrated significant effectiveness when applied to language models (LMs) focused on reasoning tasks. As RL is scaled over time, foundation models gain the capacity to address complex problems through more profound and extended reasoning (OpenAI, 2024; Guo et al., 2025). These reasoning models offer an abundant and valuable latent thoughts (Ruan et al., 2025) across the internet.