Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
Lu, Miao, Sun, Weiwei, Du, Weihua, Ling, Zhan, Yao, Xuesong, Liu, Kang, Chen, Jiecao
–arXiv.org Artificial Intelligence
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
arXiv.org Artificial Intelligence
Oct-9-2025
- Country:
- Africa
- Ghana > Central Region
- Cape Coast (0.04)
- Kenya > Nairobi City County
- Nairobi (0.04)
- Ghana > Central Region
- Asia
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- India > Tamil Nadu
- Chennai (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Bangladesh > Dhaka Division
- North America > United States
- California > Santa Clara County
- Palo Alto (0.04)
- District of Columbia > Washington (0.04)
- California > Santa Clara County
- Africa
- Genre:
- Overview (0.92)
- Research Report > New Finding (0.34)
- Workflow (1.00)
- Industry:
- Banking & Finance (0.94)
- Education (0.68)
- Government > Regional Government
- Asia Government (0.46)
- Technology: