Agents
Advancing Audio-Visual Navigation Through Multi-Agent Collaboration in 3D Environments
Zhang, Hailong, Yu, Yinfeng, Wang, Liejun, Sun, Fuchun, Zheng, Wendong
Intelligent agents often require collaborative strategies to achieve complex tasks beyond individual capabilities in real-world scenarios. While existing audio-visual navigation (AVN) research mainly focuses on single-agent systems, their limitations emerge in dynamic 3D environments where rapid multi-agent coordination is critical, especially for time-sensitive applications like emergency response. This paper introduces MASTAVN (Multi-Agent Scalable Transformer Audio-Visual Navigation), a scalable framework enabling two agents to collaboratively localize and navigate toward an audio target in shared 3D environments. By integrating cross-agent communication protocols and joint audio-visual fusion mechanisms, MASTAVN enhances spatial reasoning and temporal synchronization. Through rigorous evaluation in photorealistic 3D simulators (Replica and Matterport3D), MASTAVN achieves significant reductions in task completion time and notable improvements in navigation success rates compared to single-agent and non-collaborative baselines. This highlights the essential role of spatiotemporal coordination in multi-agent systems. Our findings validate MASTAVN's effectiveness in time-sensitive emergency scenarios and establish a paradigm for advancing scalable multi-agent embodied intelligence in complex 3D environments.
Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
Yan, Junfeng, Wu, Biao, Fang, Meng, Chen, Ling
Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning
Xiao, Hengbo, Fan, Jingyuan, Tong, Xin, Zhang, Jingzhao, Lu, Chao, He, Guannan
Tasks on complex systems require high-precision numerical computation to support decisions, but current large language models (LLMs) cannot integrate such computations as an intrinsic and interpretable capability with existing architectures. Multi-agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficiency caused by limited scalability. To this end, we propose Physically-isolated Experts Routing Network (PiERN), an architecture for integrating computation and reasoning. Instead of the tool-use workflows or function-calling, PiERN endogenously integrates computational capabilities into neural networks after separately training experts, a text-to-computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiERN on representative linear and nonlinear computation-reasoning tasks against LLM finetuning and the multi-agent system approaches. Results show that the PiERN architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi-agent approaches. PiERN offers an efficient, interpretable, and scalable paradigm for interfacing language models with scientific systems.
GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments
Farrell, Seth, Li, Chenghao, Yu, Hongzhan, Mojtahedi, Hesam, Gao, Sicun, Christensen, Henrik I.
Abstract-- We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UA Vs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. In our framework, a goal-searching UA V executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UA V flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UA Vs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UA Vs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions. Search and rescue (SAR) operations stand to benefit from recent advances in autonomous aerial and ground robotics. Unmanned Aerial V ehicles (UA Vs) enable rapid, large-area coverage due to their agility and mobility. The adoption of drones across civilian and military applications has highlighted advantages in speed and perspective.
A Hybrid TDMA/CSMA Protocol for Time-Sensitive Traffic in Robot Applications
Xu, Shiqi, Zhang, Lihao, Du, Yuyang, Yang, Qun, Liew, Soung Chang
Abstract--Recent progress in robotics has underscored the demand for real-time control in applications such as manufacturing and healthcare systems, where the timely delivery of mission-critical commands under heterogeneous robotic traffic is paramount for operational efficacy and safety. In these scenarios, mission-critical traffic follows a strict deadline-constrained communication pattern: commands must arrive within defined deadlines, otherwise late arrivals can degrade performance or destabilize control loops. In this work, we demonstrate on a real-time software-defined radio (SDR) platform that CSMA, widely adopted in robotic communications, suffers severe degradation with contention-induced collisions and delays disrupting the on-time arrival of mission-critical packets. This degradation arises under a common robotic traffic pattern where non-critical traffic dominates the channel, while lightweight mission-critical commands must be delivered frequently with strict deadlines over the shared medium. T o address this, we propose an IEEE 802.11-compatible hybrid TDMA/CSMA protocol that combines TDMA's deterministic slot scheduling with CSMA's adaptability for heterogeneous robot traffic. The protocol achieves collision-free, low-latency mission-critical command delivery and IEEE 802.11 compatibility through the synergistic integration of sub-microsecond PTP-based slot synchronization, a three-section superframe with dynamic TDMA allocation for structured and adaptable traffic management, and beacon-NA V protection to preemptively secure critical communication applications from interference. Emulation experiments on a real-time SDR testbed show that the proposed protocol reduces missed-deadline errors by 93% compared to the CSMA baseline under a robotic traffic setup at an overall aggregate channel load of 77.1%, wherein 99.9% of the traffic is from non time-critical applications and 0.1% of the traffic is from deadline-constraint applications. In a high-speed robot path-tracking Robot Operating System (ROS) simulation, the protocol lowers root mean square trajectory error by up to 90% compared with the CSMA baseline, while maintaining throughput for non-critical traffic within 2%. Robotics has undergone remarkable advancements in recent years, playing critical roles in domains such as manufacturing [1], healthcare [2]-[4], and autonomous systems [5]. Multi-robot cooperation has emerged as a key enabler for complex robotic applications that require seamless coordination among multiple devices, such as collaborative assembly [6], warehouse automation [7], and search-and-rescue missions [8]. The work was partially supported by the Shen Zhen-Hong Kong-Macao technical program (Type C) under Grant No. SGDX20230821094359004. As the number of robots grows rapidly in a multi-robot system, communications between robots are becoming increasingly data-intensive.
SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction
Chaturvedi, Saumya, Chadha, Aman, Bindschaedler, Laurent
Converting natural language queries into SQL queries is a crucial challenge in both industry and academia, aiming to increase access to databases and large-scale applications. This work examines how in-context learning and chain-of-thought can be utilized to develop a robust solution for text-to-SQL systems. We propose SQL-of-Thought: a multi-agent framework that decomposes the Text2SQL task into schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. Unlike prior systems that rely only on execution-based static correction, we introduce taxonomy-guided dynamic error modification informed by in-context learning. SQL-of-Thought achieves state-of-the-art results on the Spider dataset and its variants, combining guided error taxonomy with reasoning-based query planning.
Moving Out: Physically-grounded Human-AI Collaboration
Kang, Xuhui, Lee, Sung-Wook, Liu, Haolin, Wang, Yuyan, Kuo, Yen-Ling
The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.
Tiered Agentic Oversight: A Hierarchical Multi-Agent System for Healthcare Safety
Kim, Yubin, Jeong, Hyewon, Park, Chanwoo, Park, Eugene, Zhang, Haipeng, Liu, Xin, Lee, Hyeonhoon, McDuff, Daniel, Ghassemi, Marzyeh, Breazeal, Cynthia, Tulebaev, Samir, Park, Hae Won
Large language models (LLMs) deployed as agents introduce significant safety risks in clinical settings due to their potential for error and single points of failure. We introduce Tiered Agentic Oversight (TAO), a hierarchical multi-agent system that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse-physician-specialist) in hospital, TAO routes tasks to specialized agents based on complexity, creating a robust safety framework through automated inter- and intra-tier communication and role-playing. Crucially, this hierarchical structure functions as an effective error-correction mechanism, absorbing up to 24% of individual agent errors before they can compound. Our experiments reveal TAO outperforms single-agent and other multi-agent systems on 4 out of 5 healthcare safety benchmarks, with up to an 8.2% improvement. Ablation studies confirm key design principles of the system: (i) its adaptive architecture is over 3% safer than static, single-tier configurations, and (ii) its lower tiers are indispensable, as their removal causes the most significant degradation in overall safety. Finally, we validated the system's synergy with human doctors in a user study where a physician, acting as the highest tier agent, provided corrective feedback that improved medical triage accuracy from 40% to 60%. Project Page: https://tiered-agentic-oversight.github.io/
Communication-Efficient Desire Alignment for Embodied Agent-Human Adaptation
Wang, Yuanfei, Huang, Xinju, Zhong, Fangwei, Yang, Yaodong, Wang, Yizhou, Chen, Yuanpei, Dong, Hao
While embodied agents have made significant progress in performing complex physical tasks, real-world applications demand more than pure task execution. The agents must collaborate with unfamiliar agents and human users, whose goals are often vague and implicit. In such settings, interpreting ambiguous instructions and uncovering underlying desires is essential for effective assistance. Therefore, fast and accurate desire alignment becomes a critical capability for embodied agents. In this work, we first develop a home assistance simulation environment HA-Desire that integrates an LLM-driven proxy human user exhibiting realistic value-driven goal selection and communication. The ego agent must interact with this proxy user to infer and adapt to the user's latent desires. To achieve this, we present a novel framework FAMER for fast desire alignment, which introduces a desire-based mental reasoning mechanism to identify user intent and filter desire-irrelevant actions. We further design a reflection-based communication module that reduces redundant inquiries, and incorporate goal-relevant information extraction with memory persistence to improve information reuse and reduce unnecessary exploration. Extensive experiments demonstrate that our framework significantly enhances both task execution and communication efficiency, enabling embodied agents to quickly adapt to user-specific desires in complex embodied environments.
Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
Xiao, Sibo, Lin, Zixin, Gao, Wenyang, Chen, Hui, Zhang, Yue
Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA's feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.