Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Yao, Chaorui, Chen, Yanxi, Sun, Yuchang, Chen, Yushuo, Zhang, Wenhao, Pan, Xuchen, Li, Yaliang, Ding, Bolin

Sep-30-2025–arXiv.org Artificial Intelligence

The past few years have witnessed rapid progress in reinforcement learning (RL) for large language models (LLMs). This began with reinforcement learning from human feedback (RLHF) [Bai et al., 2022, Ouyang et al., 2022] that aligns pre-trained LLMs with human preferences, followed by reasoning-oriented RL that enables LLMs to produce long chains of thought [OpenAI, 2024, DeepSeek-AI, 2025, Kimi-Team, 2025b, Zhang et al., 2025b]. More recently, agentic RL [Kimi-Team, 2025a, Gao et al., 2025, Zhang et al., 2025a] aims to train LLMs for agentic capabilities such as tool use, long-horizon planning, and multi-step task execution in dynamic environments. Alongside these developments, off-policy RL has been attracting growing interest. In the "era of experience" [Silver and Sutton, 2025], LLM-powered agents need to be continually updated through interaction with the environment. Practical constraints in real-world deployment and the complexity of LLM-RL infrastructure often render on-policy training impractical [Noukhovitch et al., 2025]: rollout generation and model training can proceed at mismatched speeds, data might be collected from different policies, reward feedback might be irregular or delayed, and the environment may be too costly or unstable to query for fresh trajectories.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found