Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

Mroueh, Youssef, Dupuis, Nicolas, Belgodere, Brian, Nitsure, Apoorva, Rigotti, Mattia, Greenewald, Kristjan, Navratil, Jiri, Ross, Jerret, Rios, Jesus

Jun-2-2025–arXiv.org Machine Learning

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

grpo, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

Jun-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > Jordan (0.04)

Genre:
- Research Report > New Finding (0.86)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Representation & Reasoning > Optimization (0.68)
  - Machine Learning
    - Neural Networks > Deep Learning (0.69)
    - Reinforcement Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found