Tree Search for LLM Agent Reinforcement Learning

Ji, Yuxiang, Ma, Ziyu, Wang, Yong, Chen, Guanhua, Chu, Xiangxiang, Wu, Liaoni

arXiv.org Artificial Intelligence 

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.Figure 1: Comparison of chain-based and tree-based sampling strategies in LLM multi-turn agent RL. The tree structure brings two major advantages: (i) less rollout budget (both on tokens and tool-calls); (ii) higher performance. Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm for Large Language Models (LLMs), catalyzing the development of several frontier models (DeepSeek-AI Team, 2025; Y ang et al., 2025a; OpenAI, 2024). RL-tuned LLMs trained only with outcome rewards acquire complex reasoning abilities and achieve remarkable gains in single-turn tasks, such as mathematical proof and code generation (Team et al., 2025b; Y u et al., 2025; Chu et al., 2025a; Shao et al., 2024; Xin et al., 2024). This suggests that LLMs can learn not only through static imitation, but also by actively interacting with dynamic environments. Guided by this prospect, recent works have extended this RL paradigm to more complex agent settings involving dynamic, multi-turn interactions (Feng et al., 2025b; Singh et al., 2025; Wang et al., 2025b; Qian et al., 2025; Feng et al., Work done during internship at AMAP, Alibaba Group. Right (Ours): Tree search with nodes corresponding to complete agent step.