Agents
CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate
Sun, Yiliu, Zhao, Zicheng, Wan, Sheng, Gong, Chen
Nowadays, single Large Language Model (LLM) struggles with critical issues such as hallucination and inadequate reasoning abilities. To mitigate these issues, Multi-Agent Debate (MAD) has emerged as an effective strategy, where LLM agents engage in in-depth debates with others on tasks. However, existing MAD methods face two major issues: (a) too lengthy input contexts, which causes LLM agents to get lost in plenty of input information and experiences performance drop; and (b) the overconfidence dilemma, where self-assured LLM agents dominate the debate, leading to low debating effectiveness. To address these limitations, we propose a novel MAD method called "CortexDebate". Inspired by the human brain's tendency to establish a sparse and dynamically optimized network among cortical areas governed by white matter, CortexDebate constructs a sparse debating graph among LLM agents, where each LLM agent only debates with the ones that are helpful to it. To optimize the graph, we propose a module named McKinsey-based Debate Matter (MDM), which acts as an artificial analog to white matter. By integrating the McKinsey Trust Formula, a well-established measure of trustworthiness from sociology, MDM enables credible evaluations that guide graph optimization. The effectiveness of our CortexDebate has been well demonstrated by extensive experimental results across eight datasets from four task types.
Generating Novelty in Open-World Multi-Agent Strategic Board Games
Kejriwal, Mayank, Thomas, Shilpa
We describe GNOME (Generating Novelty in Open-world Multi-agent Environments), an experimental platform that is designed to test the effectiveness of multi-agent AI systems when faced with \emph{novelty}. GNOME separates the development of AI gameplaying agents with the simulator, allowing \emph{unanticipated} novelty (in essence, novelty that is not subject to model-selection bias). Using a Web GUI, GNOME was recently demonstrated at NeurIPS 2020 using the game of Monopoly to foster an open discussion on AI robustness and the nature of novelty in real-world environments. In this article, we further detail the key elements of the demonstration, and also provide an overview of the experimental design that is being currently used in the DARPA Science of Artificial Intelligence and Learning for Open-World Novelty (SAIL-ON) program to evaluate external teams developing novelty-adaptive gameplaying agents.
CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs
Yang, Bruce, He, Xinfeng, Gao, Huan, Cao, Yifan, Li, Xiaofan, Hsu, David
Effective prompt design is essential for improving the planning capabilities of large language model (LLM)-driven agents. However, existing structured prompting strategies are typically limited to single-agent, plan-only settings, and often evaluate performance solely based on task accuracy - overlooking critical factors such as token efficiency, modularity, and scalability in multi-agent environments. To address these limitations, we introduce CodeAgents, a prompting framework that codifies multi-agent reasoning and enables structured, token-efficient planning in multi-agent systems. In CodeAgents, all components of agent interaction - Task, Plan, Feedback, system roles, and external tool invocations - are codified into modular pseudocode enriched with control structures (e.g., loops, conditionals), boolean logic, and typed variables. This design transforms loosely connected agent plans into cohesive, interpretable, and verifiable multi-agent reasoning programs. We evaluate the proposed framework across three diverse benchmarks - GAIA, HotpotQA, and VirtualHome - using a range of representative LLMs. Results show consistent improvements in planning performance, with absolute gains of 3-36 percentage points over natural language prompting baselines. On VirtualHome, our method achieves a new state-of-the-art success rate of 56%. In addition, our approach reduces input and output token usage by 55-87% and 41-70%, respectively, underscoring the importance of token-aware evaluation metrics in the development of scalable multi-agent LLM systems. The code and resources are available at: https://anonymous.4open.science/r/CodifyingAgent-5A86
TrajFlow: Multi-modal Motion Prediction via Flow Matching
Yan, Qi, Zhang, Brian, Zhang, Yutong, Yang, Daniel, White, Joshua, Chen, Di, Liu, Jiachao, Liu, Langechuan, Zhuang, Binnan, Shi, Shaoshuai, Liao, Renjie
-- Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d . Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/ .
Effective Explanations for Belief-Desire-Intention Robots: When and What to Explain
Wang, Cong, Calandra, Roberto, Klös, Verena
When robots perform complex and context-dependent tasks in our daily lives, deviations from expectations can confuse users. Explanations of the robot's reasoning process can help users to understand the robot intentions. However, when to provide explanations and what they contain are important to avoid user annoyance. We have investigated user preferences for explanation demand and content for a robot that helps with daily cleaning tasks in a kitchen. Our results show that users want explanations in surprising situations and prefer concise explanations that clearly state the intention behind the confusing action and the contextual factors that were relevant to this decision. Based on these findings, we propose two algorithms to identify surprising actions and to construct effective explanations for Belief-Desire-Intention (BDI) robots. Our algorithms can be easily integrated in the BDI reasoning process and pave the way for better human-robot interaction with context- and user-specific explanations.
Curated Collaborative AI Edge with Network Data Analytics for B5G/6G Radio Access Networks
Ali, Sardar Jaffar, Raza, Syed M., Le, Duc-Tai, Challa, Rajesh, Chung, Min Young, Shroff, Ness, Choo, Hyunseung
Despite advancements, Radio Access Networks (RAN) still account for over 50\% of the total power consumption in 5G networks. Existing RAN split options do not fully harness data potential, presenting an opportunity to reduce operational expenditures. This paper addresses this opportunity through a twofold approach. First, highly accurate network traffic and user predictions are achieved using the proposed Curated Collaborative Learning (CCL) framework, which selectively collaborates with relevant correlated data for traffic forecasting. CCL optimally determines whom, when, and what to collaborate with, significantly outperforming state-of-the-art approaches, including global, federated, personalized federated, and cyclic institutional incremental learnings by 43.9%, 39.1%, 40.8%, and 31.35%, respectively. Second, the Distributed Unit Pooling Scheme (DUPS) is proposed, leveraging deep reinforcement learning and prediction inferences from CCL to reduce the number of active DU servers efficiently. DUPS dynamically redirects traffic from underutilized DU servers to optimize resource use, improving energy efficiency by up to 89% over conventional strategies, translating into substantial monetary benefits for operators. By integrating CCL-driven predictions with DUPS, this paper demonstrates a transformative approach for minimizing energy consumption and operational costs in 5G RANs, significantly enhancing efficiency and cost-effectiveness.
Responsibility Gap and Diffusion in Sequential Decision-Making Mechanisms
Responsibility has long been a subject of study in law and philosophy. More recently, it became a focus of AI literature. The article investigates the computational complexity of two important properties of responsibility in collective decision-making: diffusion and gap. It shows that the sets of diffusion-free and gap-free decision-making mechanisms are $Π_2$-complete and $Π_3$-complete, respectively. At the same time, the intersection of these classes is $Π_2$-complete.
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Gou, Boyu, Huang, Zanming, Ning, Yuting, Gu, Yu, Lin, Michael, Qi, Weijian, Kopanev, Andrei, Yu, Botao, Gutiérrez, Bernal Jiménez, Shu, Yiheng, Song, Chan Hee, Wu, Jiaman, Chen, Shijie, Moussa, Hanane Nour, Zhang, Tianshu, Xie, Jian, Li, Yifei, Xue, Tianci, Liao, Zeyi, Zhang, Kai, Zheng, Boyuan, Cai, Zhaowei, Rozgic, Viktor, Ziyadi, Morteza, Sun, Huan, Su, Yu
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making
Shang, Tianqi, He, Weiqing, Zheng, Charles, Li, Lingyao, Shen, Li, Zhao, Bingxin
The rise of Large Language Models (LLMs) has enabled the development of specialized AI agents with domain-specific reasoning and interaction capabilities, particularly in healthcare. While recent frameworks simulate medical decision-making, they largely focus on single-turn tasks where a doctor agent receives full case information upfront -- diverging from the real-world diagnostic process, which is inherently uncertain, interactive, and iterative. In this paper, we introduce MIMIC-Patient, a structured dataset built from the MIMIC-III electronic health records (EHRs), designed to support dynamic, patient-level simulations. Building on this, we propose DynamiCare, a novel dynamic multi-agent framework that models clinical diagnosis as a multi-round, interactive loop, where a team of specialist agents iteratively queries the patient system, integrates new information, and dynamically adapts its composition and strategy. We demonstrate the feasibility and effectiveness of DynamiCare through extensive experiments, establishing the first benchmark for dynamic clinical decision-making with LLM-powered agents.
Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System
Kostka, Adam, Chudziak, Jarosław A.
This paper explores the integration of advanced Multi-Agent Systems (MAS) techniques to develop a team of agents with enhanced logical reasoning, long-term knowledge retention, and Theory of Mind (ToM) capabilities. By uniting these core components with optimized communication protocols, we create a novel framework called SynergyMAS, which fosters collaborative teamwork and superior problem-solving skills. The system's effectiveness is demonstrated through a product development team case study, where our approach significantly enhances performance and adaptability. These findings highlight SynergyMAS's potential to tackle complex, real-world challenges.