Group Sequence Policy Optimization

Zheng, Chujie, Liu, Shixuan, Li, Mingze, Chen, Xiong-Hui, Yu, Bowen, Gao, Chang, Dang, Kai, Liu, Yuqiong, Men, Rui, Yang, An, Zhou, Jingren, Lin, Junyang

Jul-29-2025–arXiv.org Artificial Intelligence

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

large language model, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

Jul-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.70)
  - Natural Language > Large Language Model (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found