Group-in-Group Policy Optimization for LLMAgent Training
–Neural Information Processing Systems
Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a twolevel structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation.
Neural Information Processing Systems
Jun-16-2026, 19:43:55 GMT
- Country:
- North America > United States > California (0.28)
- Genre:
- Workflow (1.00)
- Research Report > Experimental Study (1.00)
- Industry:
- Information Technology (0.92)
- Media (0.69)
- Education (0.67)
- Leisure & Entertainment > Games (0.46)
- Technology: