On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
Wu, Yongliang, Zhou, Yizhou, Ziheng, Zhou, Peng, Yingzhe, Ye, Xinyu, Hu, Xinting, Zhu, Wenbo, Qi, Lu, Yang, Ming-Hsuan, Yang, Xu
–arXiv.org Artificial Intelligence
In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, DFT achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. Supervised Fine-Tuning (SFT), which adapts models to expert demonstrations, has become the standard post-training paradigm for Large Language Models (LLMs). It enables efficient task adaptation and capability enhancement (Chung et al., 2024; Zhang et al., 2024b; Sanh et al., 2022; Ouyang et al., 2022), and is popular for its ease of implementation and rapid acquisition of expert-like behaviors (Wei et al., 2022; Zhou et al., 2023). Despite these advantages, SFT often shows limited generalization compared to reinforcement learning (RL) (Chu et al., 2024; Ouyang et al., 2022; Christiano et al., 2017; Bai et al., 2022; Huan et al., 2025; Swamy et al., 2025). RL leverages explicit reward or verification signals to explore diverse strategies and thus generalizes better. However, RL requires substantial computation, careful hyperparameter tuning, and explicit reward signals--conditions often impractical in real-world settings (Schulman et al., 2017; Ouyang et al., 2022; Sheng et al., 2025; Strubell et al., 2019; Liu & Yin, 2024; Winsta, 2025). Moreover, RL can struggle to recover expert-like behaviors that SFT captures efficiently (Mandlekar et al., 2022; Chen et al., 2025b).
arXiv.org Artificial Intelligence
Oct-17-2025
- Country:
- Asia > China
- Hubei Province > Wuhan (0.04)
- Shanghai > Shanghai (0.04)
- North America > United States
- California
- Alameda County > Berkeley (0.04)
- Los Angeles County > Los Angeles (0.14)
- Merced County > Merced (0.04)
- California
- Asia > China
- Genre:
- Research Report (0.64)
- Technology: