Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents
Tan, Weiting, Qu, Xinghua, Tu, Ming, Ge, Meng, Liu, Andy T., Koehn, Philipp, Lu, Lu
–arXiv.org Artificial Intelligence
Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.
arXiv.org Artificial Intelligence
Sep-19-2025
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.68)
- Leisure & Entertainment > Games (0.46)
- Technology: