Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Yao, Chaorui, Chen, Yanxi, Sun, Yuchang, Chen, Yushuo, Zhang, Wenhao, Pan, Xuchen, Li, Yaliang, Ding, Bolin

arXiv.org Artificial Intelligence 

The past few years have witnessed rapid progress in reinforcement learning (RL) for large language models (LLMs). This began with reinforcement learning from human feedback (RLHF) [Bai et al., 2022, Ouyang et al., 2022] that aligns pre-trained LLMs with human preferences, followed by reasoning-oriented RL that enables LLMs to produce long chains of thought [OpenAI, 2024, DeepSeek-AI, 2025, Kimi-Team, 2025b, Zhang et al., 2025b]. More recently, agentic RL [Kimi-Team, 2025a, Gao et al., 2025, Zhang et al., 2025a] aims to train LLMs for agentic capabilities such as tool use, long-horizon planning, and multi-step task execution in dynamic environments. Alongside these developments, off-policy RL has been attracting growing interest. In the "era of experience" [Silver and Sutton, 2025], LLM-powered agents need to be continually updated through interaction with the environment. Practical constraints in real-world deployment and the complexity of LLM-RL infrastructure often render on-policy training impractical [Noukhovitch et al., 2025]: rollout generation and model training can proceed at mismatched speeds, data might be collected from different policies, reward feedback might be irregular or delayed, and the environment may be too costly or unstable to query for fresh trajectories.