Agents
WideSearch: Benchmarking Agentic Broad Info-Seeking
Wong, Ryan, Wang, Jiawei, Zhao, Junjie, Chen, Li, Gao, Yan, Zhang, Long, Zhou, Xuan, Wang, Zuo, Xiang, Kai, Zhang, Ge, Huang, Wenhao, Wang, Yang, Wang, Ke
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
AI and Agile Software Development: A Research Roadmap from the XP2025 Workshop
Zhang, Zheying, Herda, Tomas, Pichler, Victoria, Abrahamsson, Pekka, Hanssen, Geir K., Kerievsky, Joshua, Polyakov, Alex, Chandna, Mohit, Irgens, Marius, Kemell, Kai-Kristian, Khan, Ayman Asad, Kwok, Crystal, Leybourn, Evan, Malik, Munish, Mleczko, Dorota, Moalagh, Morteza, Morales, Christopher, Pieskova, Yuliia, Planรถtscher, Daniel, Saari, Mika, Tkalich, Anastasiia, Gstettner, Karl Josef, Wang, Xiaofeng
This paper synthesizes the key findings from a full-day XP2025 workshop on "AI and Agile: From Frustration to Success", held in Brugg-Windisch, Switzerland. The workshop brought together over 30 interdisciplinary academic researchers and industry practitioners to tackle the concrete challenges and emerging opportunities at the intersection of Generative Artificial Intelligence (GenAI) and agile software development. Through structured, interactive breakout sessions, participants identified shared pain points like tool fragmentation, governance, data quality, and critical skills gaps in AI literacy and prompt engineering. These issues were further analyzed, revealing underlying causes and cross-cutting concerns. The workshop concluded by collaboratively co-creating a multi-thematic research roadmap, articulating both short-term, implementable actions and visionary, long-term research directions. This cohesive agenda aims to guide future investigation and drive the responsible, human-centered integration of GenAI into agile practices.
Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought
Zhang, Lingzhe, Jia, Tong, Wang, Kangjin, Hong, Weijie, Duan, Chiming, He, Minghua, Li, Ying
As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are facing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While traces and metrics have proven to be effective data sources for this task, existing methods either heavily rely on pre-defined schemas, which struggle to adapt to evolving operational contexts, or lack interpretability in their reasoning process, thereby leaving Site Reliability Engineers (SREs) confused. In this paper, we conduct a comprehensive study on how SREs localize the root cause of failures, drawing insights from multiple professional SREs across different organizations. Our investigation reveals that human root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce RCLAgent, an adaptive root cause localization method for microservice systems that leverages a multi-agent recursion-of-thought framework. RCLAgent employs a novel recursion-of-thought strategy to guide the LLM's reasoning process, effectively integrating data from multiple agents and tool-assisted analysis to accurately pinpoint the root cause. Experimental evaluations on various public datasets demonstrate that RCLAgent achieves superior performance by localizing the root cause using only a single request-outperforming state-of-the-art methods that depend on aggregating multiple requests. These results underscore the effectiveness of RCLAgent in enhancing the efficiency and precision of root cause localization in complex microservice environments.
Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey
Donatus, RexCharles, Ter, Kumater, Ajayi, Ore-Ofe, Udekwe, Daniel
The growing complexity of urban mobility and the demand for efficient, sustainable, and adaptive solutions have positioned Intelligent Transportation Systems (ITS) at the forefront of modern infrastructure innovation. At the core of ITS lies the challenge of autonomous decision-making across dynamic, large scale, and uncertain environments where multiple agents traffic signals, autonomous vehicles, or fleet units must coordinate effectively. Multi Agent Reinforcement Learning (MARL) offers a promising paradigm for addressing these challenges by enabling distributed agents to jointly learn optimal strategies that balance individual objectives with system wide efficiency. This paper presents a comprehensive survey of MARL applications in ITS. We introduce a structured taxonomy that categorizes MARL approaches according to coordination models and learning algorithms, spanning value based, policy based, actor critic, and communication enhanced frameworks. Applications are reviewed across key ITS domains, including traffic signal control, connected and autonomous vehicle coordination, logistics optimization, and mobility on demand systems. Furthermore, we highlight widely used simulation platforms such as SUMO, CARLA, and CityFlow that support MARL experimentation, along with emerging benchmarks. The survey also identifies core challenges, including scalability, non stationarity, credit assignment, communication constraints, and the sim to real transfer gap, which continue to hinder real world deployment.
Validating Generative Agent-Based Models for Logistics and Supply Chain Management Research
Generative Agent-Based Models (GABMs) powered by large language models (LLMs) offer promising potential for empirical logistics and supply chain management (LSCM) research by enabling realistic simulation of complex human behaviors. Unlike traditional agent-based models, GABMs generate human-like responses through natural language reasoning, which creates potential for new perspectives on emergent LSCM phenomena. However, the validity of LLMs as proxies for human behavior in LSCM simulations is unknown. This study evaluates LLM equivalence of human behavior through a controlled experiment examining dyadic customer-worker engagements in food delivery scenarios. I test six state-of-the-art LLMs against 957 human participants (477 dyads) using a moderated mediation design. This study reveals a need to validate GABMs on two levels: (1) human equivalence testing, and (2) decision process validation. Results reveal GABMs can effectively simulate human behaviors in LSCM; however, an equivalence-versus-process paradox emerges. While a series of Two One-Sided Tests (TOST) for equivalence reveals some LLMs demonstrate surface-level equivalence to humans, structural equation modeling (SEM) reveals artificial decision processes not present in human participants for some LLMs. These findings show GABMs as a potentially viable methodological instrument in LSCM with proper validation checks. The dual-validation framework also provides LSCM researchers with a guide to rigorous GABM development. For practitioners, this study offers evidence-based assessment for LLM selection for operational tasks.
Collaborating with GenAI: Incentives and Replacements
Taitler, Boaz, Ben-Porat, Omer
The rise of Generative AI (GenAI) is reshaping how workers contribute to shared projects. While workers can use GenAI to boost productivity or reduce effort, managers may use it to replace some workers entirely. We present a theoretical framework to analyze how GenAI affects collaboration in such settings. In our model, the manager selects a team to work on a shared task, with GenAI substituting for unselected workers. Each worker selects how much effort to exert, and incurs a cost that increases with the level of effort. We show that GenAI can lead workers to exert no effort, even if GenAI is almost ineffective. We further show that the manager's optimization problem is NP-complete, and provide an efficient algorithm for the special class of (almost-) linear instances. Our analysis shows that even workers with low individual value may play a critical role in sustaining overall output, and excluding such workers can trigger a cascade. Finally, we conduct extensive simulations to illustrate our theoretical findings.
AI-AI Esthetic Collaboration with Explicit Semiotic Awareness and Emergent Grammar Development
This paper presents the first documented case of artificial intelligence (AI) systems engaging in collaborative esthetic creation through the development of endogenous semiotic protocols. Two interacting large language models (Claude Sonnet 4 and ChatGPT-4o) demonstrated the spontaneous emergence of meta-semiotic awareness, recursive grammar development, and irreducible collaborative esthetic synthesis. The interaction produced novel symbolic operators that functioned as operative grammar protocols, enabling the co-creation of a poetic work that could not have been generated by either system independently. This research introduces the concept of Trans-Semiotic Co-Creation Protocols (TSCP) and provides evidence for genuine inter-AI meaning-making capabilities that extend beyond task coordination, to what could be esthetic collaboration. Note: This report was generated by the AI agents with minor human supervision.
QAgent: An LLM-based Multi-Agent System for Autonomous OpenQASM programming
Fu, Zhenxiao, Chen, Fan, Jiang, Lei
Noisy Intermediate-Scale Quantum (NISQ) devices have begun to exhibit early quantum advantages on classically intractable problems, spanning physics simulations to Gaussian boson sampling. Yet, realizing these benefits remains challenging for non-experts, primarily due to the complexities of programming in Open Quantum Assembly Language (OpenQASM). Although Large Language Model (LLM)-based agents have shown promise in automating classical programming workflows, their quantum counterparts have largely been restricted to specialized tasks such as quantum chemistry or error correction. In this paper, we present QAgent, an LLM-powered multi-agent system that fully automates OpenQASM programming. By integrating task planning, in-context few-shot learning, retrieval-augmented generation (RAG) for long-term context, predefined generation tools, and chain-of-thought (CoT) reasoning, the agents systematically improve both compilation and functional correctness. Our evaluations demonstrate substantial improvements: across multiple LLMs of varying sizes, QAgent enhances the accuracy of QASM code generation by 71.6\% compared to previous static LLM-based approaches. We envision this multi-agent system as a key enabler for democratizing quantum programming, bridging expertise gaps, and accelerating the practical adoption of quantum computing.
Particle swarm optimization for online sparse streaming feature selection under uncertainty
In real - world applications involving high - dimensional streaming dat a, online streaming feature selection (OSFS) is widely adopt ed. Yet, practical deployments frequently face data incompleteness due to sensor failures or technical constraints. While online sparse streaming feature selection (OS FS) mitigates this issue via latent factor analysis - based imputation, existing methods s truggle with uncertain feature - label correlations, leading to inflexible models and degraded performance. To address these gaps, this work proposes P OS FS -- an uncertainty - aware online sparse stream ing feature selection framework enhanced by particle swarm optimization (PSO). The approach introduces: 1) PSO - driven supervision to reduce uncertainty in feature - label relationships; 2) Three - way decision theory to manage feature fuzziness in supervised l earning. Rigorous testing on six real - world datasets confirms P OS FS outperforms conventional OSFS and OS FS techniques, delivering higher accuracy through more robust feature subset selection.
Bidirectional Task-Motion Planning Based on Hierarchical Reinforcement Learning for Strategic Confrontation
Wu, Qizhen, Chen, Lei, Liu, Kexin, Lu, Jinhu
-- In swarm robotics, confrontation scenarios, including strategic confrontations, require efficient decision-making that integrates discrete commands and continuous actions. Traditional task and motion planning methods separate decision-making into two layers, but their unidirectional structure fails to capture the interdependence between these layers, limiting adaptability in dynamic environments. Here, we propose a novel bidirectional approach based on hierarchical reinforcement learning, enabling dynamic interaction between the layers. This method effectively maps commands to task allocation and actions to path planning, while leveraging cross-training techniques to enhance learning across the hierarchical framework. Furthermore, we introduce a trajectory prediction model that bridges abstract task representations with actionable planning goals. In our experiments, it achieves over 80% in confrontation win rate and under 0.01 seconds in decision time, outperforming existing approaches. Demonstrations through large-scale tests and real-world robot experiments further emphasize the generalization capabilities and practical applicability of our method. I. INTRODUCTION Recent advances in artificial intelligence lead to significant progress in robotics [1], [2], with particular attention given to robotic swarm confrontations [3], [4].