Group-in-Group Policy Optimization for LLMAgent Training

Jun-16-2026, 19:43:55 GMT–Neural Information Processing Systems

Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a twolevel structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Jun-16-2026, 19:43:55 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California (0.28)

Genre:
- Workflow (1.00)
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.92)
- Media (0.69)
- Education (0.67)
- Leisure & Entertainment > Games (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found