Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
Zhang, Zhiwei, Li, Xiaomin, Lin, Yudi, Liu, Hui, Chandradevan, Ramraj, Wu, Linlin, Lin, Minhua, Wang, Fali, Tang, Xianfeng, He, Qi, Wang, Suhang
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks. Techniques such as chain-of-thought prompting (Wei et al., 2022; Kojima et al., 2022) and structured methods like Tree-of-Thoughts and Graph-of-Thoughts (Y ao et al., 2023; Besta et al., 2024) expand the space for deliberation. More recently, multi-agent frameworks enable LLMs with specialized roles to collaborate via planning, delegation, and debate, echoing human team dynamics (Li et al., 2023; Wu et al., 2024a; Chen et al., 2023; Du et al., 2023; Y uan & Xie). To support multi-agent and multi-turn reinforcement learning, multi-turn Group Relative Preference Optimization (GRPO) (Wan et al., 2025; Shi et al., 2025; Wei et al., 2025) and its variants (Guo et al., 2025b; Zhang et al., 2025c; Ning et al., 2025; Xue et al., 2025) compute advantages and importance ratios at the turn level, enabling finer-grained optimization and more precise credit assignment. Building on this foundation, ReMA (Wan et al., 2025) introduces a multi-agent LLM reasoning framework with two specialized roles: a meta-thinking agent, which decomposes tasks, sets intermediate goals, and adapts based on feedback, and a reasoning agent, which performs step-by-step 1 The agents alternate sequentially, but since only a final outcome reward is available, ReMA computes a group advantage following GRPO (Shao et al., 2024) and uniformly assigns this trajectory-level signal to every turn in the rollout.
arXiv.org Artificial Intelligence
Nov-5-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- North America > United States
- Michigan (0.04)
- Pennsylvania (0.04)
- Utah (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Workflow (0.66)
- Technology: