Agents
Design for One, Deploy for Many: Navigating Tree Mazes with Multiple Agents
Argote-Gerald, Jahir, Miyauchi, Genki, Rau, Julian, Trodden, Paul, Gross, Roderich
Maze-like environments, such as cave and pipe networks, pose unique challenges for multiple robots to coordinate, including communication constraints and congestion. To address these challenges, we propose a distributed multi-agent maze traversal algorithm for environments that can be represented by acyclic graphs. It uses a leader-switching mechanism where one agent, assuming a head role, employs any single-agent maze solver while the other agents each choose an agent to follow. The head role gets transferred to neighboring agents where necessary, ensuring it follows the same path as a single agent would. The multi-agent maze traversal algorithm is evaluated in simulations with groups of up to 300 agents, various maze sizes, and multiple single-agent maze solvers. It is compared against strategies that are naïve, or assume either global communication or full knowledge of the environment. The algorithm outperforms the naïve strategy in terms of makespan and sum-of-fuel. It is superior to the global-communication strategy in terms of makespan but is inferior to it in terms of sum-of-fuel. The findings suggest it is asymptotically equivalent to the full-knowledge strategy with respect to either metric. Moreover, real-world experiments with up to 20 Pi-puck robots confirm the feasibility of the approach.
The Denario project: Deep knowledge AI agents for scientific discovery
Villaescusa-Navarro, Francisco, Bolliet, Boris, Villanueva-Domingo, Pablo, Bayer, Adrian E., Acquah, Aidan, Amancharla, Chetana, Barzilay-Siegal, Almog, Bermejo, Pablo, Bilodeau, Camille, Ramírez, Pablo Cárdenas, Cranmer, Miles, França, Urbano L., Hahn, ChangHoon, Jiang, Yan-Fei, Jimenez, Raul, Lee, Jun-Young, Lerario, Antonio, Mamun, Osman, Meier, Thomas, Ojha, Anupam A., Protopapas, Pavlos, Roy, Shimanto, Spergel, David N., Tarancón-Álvarez, Pedro, Tiwari, Ujjwal, Viel, Matteo, Wadekar, Digvijay, Wang, Chi, Wang, Bonny Y., Xu, Licong, Yovel, Yossi, Yue, Shuwen, Zhou, Wen-Han, Zhu, Qiyao, Zou, Jiajun, Zubeldia, Íñigo
We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.
Artificially intelligent agents in the social and behavioral sciences: A history and outlook
Holme, Petter, Tsvetkova, Milena
We review the historical development and current trends of artificially intelligent agents (agentic AI) in the social and behavioral sciences: from the first programmable computers, and social simulations soon thereafter, to today's experiments with large language models. This overview emphasizes the role of AI in the scientific process and the changes brought about, both through technological advancements and the broader evolution of science from around 1950 to the present. Some of the specific points we cover include: the challenges of presenting the first social simulation studies to a world unaware of computers, the rise of social systems science, intelligent game theoretic agents, the age of big data and the epistemic upheaval in its wake, and the current enthusiasm around applications of generative AI, and many other topics. A pervasive theme is how deeply entwined we are with the technologies we use to understand ourselves.
PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features
Li, Lingyao, Wu, Haolun, Li, Zhenkun, Hu, Jiabei, Wang, Yu, Huang, Xiaoshan, Hua, Wenyue, Wang, Wenqian
High-dimensional decision-making tasks, such as business partner selection, involve evaluating large candidate pools with heterogeneous numerical, categorical, and textual features. MAS, a hierarchical multi-agent framework that decomposes evaluation into three layers: a Planner Agent that designs strategies, Specialized Agents that perform role-specific assessments, and a Supervisor Agent that integrates their outputs. To support systematic evaluation, we also introduce a curated benchmark dataset of venture capital co-investments, featuring diverse firm attributes and ground-truth syndicates. MAS consistently outperforms single-agent and debate-based multi-agent baselines, achieving up to 10-15% higher match rates. Analysis of agent reasoning shows that planners are most responsive to domain-informed prompts, specialists produce complementary feature coverage, and supervisors play an important role in aggregation. Our implementation is available at this anonymous link. In real-world decision-making, practitioners often navigate high-dimensional data including extensive option sets and numerous evaluative features (Sandanayake et al., 2018; Sigle et al., 2023). Business partner selection which includes partner shortlisting and strategic alliance formation exemplifies this challenge (Mindruta et al., 2016): firms often face a vast pool of potential candidates, each described by diverse attributes ranging from quantitative indicators (e.g., financial metrics, geographic presence) to text-rich information (e.g., strategic fit, investment preferences) (Shah & Swaminathan, 2008). The scale and complexity of such data can easily overwhelm human decision-makers, incurring significant costs (Li et al., 2008). This underscores the need for intelligent systems capable of analyzing large candidate sets and diverse features. Large language models (LLMs) have emerged as promising tools for addressing reasoning tasks in data-rich domains (Lee et al., 2025; Mischler et al., 2024). With appropriate prompting (e.g., few-shot learning) or information retrieval techniques (e.g., RAG), these models can identify salient features using only feature and task descriptions, achieving performance comparable to established methods (Li et al., 2025a; Jeong et al., 2024).
Mano Technical Report
Fu, Tianyu, Su, Anyang, Zhao, Chenxu, Wang, Hanning, Wu, Minghui, Yu, Zhe, Hu, Fei, Shi, Mingjia, Dong, Wei, Wang, Jiayao, Chen, Yuyang, Yu, Ruiyang, Peng, Siran, Li, Menglin, Huang, Nan, Wei, Haitian, Yu, Jiawei, Xin, Yi, Zhao, Xilin, Gu, Kai, Jiang, Ping, Zhou, Sifan, Wang, Shuo
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
RoboOS-NeXT: A Unified Memory-based Framework for Lifelong, Scalable, and Robust Multi-Robot Collaboration
Tan, Huajie, Chi, Cheng, Chen, Xiansheng, Ji, Yuheng, Zhao, Zhongxia, Hao, Xiaoshuai, Lyu, Yaoxu, Cao, Mingyu, Zhao, Junkai, Lyu, Huaihai, Zhou, Enshen, Chen, Ning, Fu, Yankai, Peng, Cheng, Guo, Wei, Liang, Dong, Chen, Zhuo, Lyu, Mengsi, He, Chenrui, Ao, Yulong, Lin, Yonghua, Wang, Pengwei, Wang, Zhongyuan, Zhang, Shanghang
The proliferation of collaborative robots across diverse tasks and embodiments presents a central challenge: achieving lifelong adaptability, scalable coordination, and robust scheduling in multi-agent systems. Existing approaches, from vision-language-action (VLA) models to hierarchical frameworks, fall short due to their reliance on limited or dividual-agent memory. This fundamentally constrains their ability to learn over long horizons, scale to heterogeneous teams, or recover from failures, highlighting the need for a unified memory representation. To address these limitations, we introduce RoboOS-NeXT, a unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration. At the core of RoboOS-NeXT is the novel Spatio-Temporal-Embodiment Memory (STEM), which integrates spatial scene geometry, temporal event history, and embodiment profiles into a shared representation. This memory-centric design is integrated into a brain-cerebellum framework, where a high-level brain model performs global planning by retrieving and updating STEM, while low-level controllers execute actions locally. This closed loop between cognition, memory, and execution enables dynamic task allocation, fault-tolerant collaboration, and consistent state synchronization. We conduct extensive experiments spanning complex coordination tasks in restaurants, supermarkets, and households. Our results demonstrate that RoboOS-NeXT achieves superior performance across heterogeneous embodiments, validating its effectiveness in enabling lifelong, scalable, and robust multi-robot collaboration. Project website: https://flagopen.github.io/RoboOS/
Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL
Formanek, Claude, Mahjoub, Omayma, Nessir, Louay Ben, Abramowitz, Sasha, de Kock, Ruan, Khlifi, Wiem, Rajaonarivonivelomanantsoa, Daniel, Toit, Simon Du, Fokam, Arnol, Singh, Siddarth, Sob, Ulrich Mbou, Chalumeau, Felix, Pretorius, Arnu
A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works -- SMAC, RWARE, and Multi-Agent MuJoCo -- covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Pang, Wei, Lin, Kevin Qinghong, Jian, Xiangru, He, Xi, Torr, Philip
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
Zhu, Yinghao, He, Ziyi, Hu, Haoran, Zheng, Xiaochen, Zhang, Xichen, Wang, Zixiang, Gao, Junyi, Ma, Liantao, Yu, Lequan
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
Plasticity as the Mirror of Empowerment
Abel, David, Bowling, Michael, Barreto, André, Dabney, Will, Dong, Shi, Hansen, Steven, Harutyunyan, Anna, Khetarpal, Khimya, Lyle, Clare, Pascanu, Razvan, Piliouras, Georgios, Precup, Doina, Richens, Jonathan, Rowland, Mark, Schaul, Tom, Singh, Satinder
Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency