Agents
Trust-Aware Embodied Bayesian Persuasion for Mixed-Autonomy
Peng, Shaoting, Driggs-Campbell, Katherine, Dong, Roy
Safe and efficient interaction between autonomous vehicles (AVs) and human-driven vehicles (HVs) is a critical challenge for future transportation systems. While game-theoretic models capture how AVs influence HVs, they often suffer from a long-term decay of influence and can be perceived as manipulative, eroding the human's trust. This can paradoxically lead to riskier human driving behavior over repeated interactions. In this paper, we address this challenge by proposing the Trust-Aware Embodied Bayesian Persuasion (TA-EBP) framework. Our work makes three key contributions: First, we apply Bayesian persuasion to model communication at traffic intersections, offering a transparent alternative to traditional game-theoretic models. Second, we introduce a trust parameter to the persuasion framework, deriving a theorem for the minimum trust level required for influence. Finally, we ground the abstract signals of Bayesian persuasion theory into a continuous, physically meaningful action space, deriving a second theorem for the optimal signal magnitude, realized as an AV's forward nudge. Additionally, we validate our framework in a mixed-autonomy traffic simulation, demonstrating that TA-EBP successfully persuades HVs to drive more cautiously, eliminating collisions and improving traffic flow compared to baselines that either ignore trust or lack communication. Our work provides a transparent and non-strategic framework for influence in human-robot interaction, enhancing both safety and efficiency.
Dynamic Agent Grouping ECBS: Scaling Windowed Multi-Agent Path Finding with Completeness Guarantees
Zhang, Tiannan, Veerapaneni, Rishi, Chan, Shao-Hung, Li, Jiaoyang, Likhachev, Maxim
Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths for a team of agents. Although several MAPF methods which solve full-horizon MAPF have completeness guarantees, very few MAPF methods that plan partial paths have completeness guarantees. Recent work introduced the Windowed Complete MAPF (WinC-MAPF) framework, which shows how windowed optimal MAPF solvers (e.g., SS-CBS) can use heuristic updates and disjoint agent groups to maintain completeness even when planning partial paths (V eerapaneni et al. 2024). A core limitation of WinC-MAPF is that they required optimal MAPF solvers. Our main contribution is to extend WinC-MAPF by showing how we can use a bounded suboptimal solver while maintaining completeness. In particular, we design Dynamic Agent Grouping ECBS (DAG-ECBS) which dynamically creates and plans agent groups while maintaining that each agent group solution is bounded suboptimal. We prove how DAG-ECBS can maintain completeness in the WinC-MAPF framework. DAG-ECBS shows improved scalability compared to SS-CBS and can outperform windowed ECBS without completeness guarantees. More broadly, our work serves as a blueprint for designing more MAPF methods that can use the WinC-MAPF framework.
Diagnostics of cognitive failures in multi-agent expert systems using dynamic evaluation protocols and subsequent mutation of the processing context
Sorstkins, Andrejs, Bailey, Josh, Baron, Dr Alistair
The rapid evolution of neural architectures - from multilayer perceptrons to large-scale Transformer-based models - has enabled language models (LLMs) to exhibit emergent agentic behaviours when equipped with memory, planning, and external tool use. However, their inherent stochasticity and multi-step decision processes render classical evaluation methods inadequate for diagnosing agentic performance. This work introduces a diagnostic framework for expert systems that not only evaluates but also facilitates the transfer of expert behaviour into LLM-powered agents. The framework integrates (i) curated golden datasets of expert annotations, (ii) silver datasets generated through controlled behavioural mutation, and (iii) an LLM-based Agent Judge that scores and prescribes targeted improvements. These prescriptions are embedded into a vectorized recommendation map, allowing expert interventions to propagate as reusable improvement trajectories across multiple system instances. We demonstrate the framework on a multi-agent recruiter-assistant system, showing that it uncovers latent cognitive failures - such as biased phrasing, extraction drift, and tool misrouting - while simultaneously steering agents toward expert-level reasoning and style. The results establish a foundation for standardized, reproducible expert behaviour transfer in stochastic, tool-augmented LLM agents, moving beyond static evaluation to active expert system refinement.
Generating Plans for Belief-Desire-Intention (BDI) Agents Using Alternating-Time Temporal Logic (ATL)
Belief-Desire-Intention (BDI) is a framework for modelling agents based on their beliefs, desires, and intentions. Plans are a central component of BDI agents, and define sequences of actions that an agent must undertake to achieve a certain goal. Existing approaches to plan generation often require significant manual effort, and are mainly focused on single-agent systems. As a result, in this work, we have developed a tool that automatically generates BDI plans using Alternating-Time Temporal Logic (ATL). By using ATL, the plans generated accommodate for possible competition or cooperation between the agents in the system. We demonstrate the effectiveness of the tool by generating plans for an illustrative game that requires agent collaboration to achieve a shared goal. We show that the generated plans allow the agents to successfully attain this goal.
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
Li, Simin, Yuwei, Zheng, Mao, Zihao, Wang, Linhao, Xu, Ruixiao, Ma, Chengdong, Yu, Xin, Ma, Yuqing, Dou, Qi, Wang, Xin, Luo, Jie, An, Bo, Yang, Yaodong, Lv, Weifeng, Liu, Xianglong
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance. In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where the upper level involves an NP-hard combinatorial task of selecting the most vulnerable agents, and the lower level learns worst-case adversarial policies for these agents using mean-field MARL. The two problems are coupled together, making HAD-MFC difficult to solve. To solve this, we first decouple the hierarchical process by Fenchel-Rockafellar transform, resulting a regularized mean-field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We then reformulate the upper-level combinatorial problem as a MDP with dense rewards from our regularized mean-field Bellman operator, enabling us to sequentially identify the most vulnerable agents by greedy and RL algorithms. This decomposition provably preserves the optimal solution of the original HAD-MFC. Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learns a value function that reveals the vulnerability of each agent.
LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring
Jang, Jinhee, Moon, Ayoung, Jung, Minkyoung, Kim, YoungBin, Lee, Seung Jin
The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long-standing and practical application of natural language processing in education. However, achieving human-level multi-perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi-agent evaluation framework designed to perform precise and human-aligned scoring under a zero-shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait-based rubric and conducts a multi-perspective evaluation. Then, by simulating a roundtable-style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero-shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.
Online Learning of Deceptive Policies under Intermittent Observation
Puthumanaillam, Gokul, Padmanabhan, Ram, Fuentes, Jose, Cruz, Nicole, Padrao, Paulo, Hernandez, Ruben, Jiang, Hao, Schafer, William, Bobadilla, Leonardo, Ornik, Melkior
In supervisory control settings, autonomous systems are not monitored continuously. Instead, monitoring often occurs at sporadic intervals within known bounds. We study the problem of deception, where an agent pursues a private objective while remaining plausibly compliant with a supervisor's reference policy when observations occur. Motivated by the behavior of real, human supervisors, we situate the problem within Theory of Mind: the representation of what an observer believes and expects to see. We show that Theory of Mind can be repurposed to steer online reinforcement learning (RL) toward such deceptive behavior. We model the supervisor's expectations and distill from them a single, calibrated scalar -- the expected evidence of deviation if an observation were to happen now. This scalar combines how unlike the reference and current action distributions appear, with the agent's belief that an observation is imminent. Injected as a state-dependent weight into a KL-regularized policy improvement step within an online RL loop, this scalar informs a closed-form update that smoothly trades off self-interest and compliance, thus sidestepping hand-crafted or heuristic policies. In real-world, real-time hardware experiments on marine (ASV) and aerial (UAV) navigation, our ToM-guided RL runs online, achieves high return and success with observed-trace evidence calibrated to the supervisor's expectations.
SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints
Fan, Zhiyu, Vasilevski, Kirill, Lin, Dayi, Chen, Boyuan, Chen, Yihao, Zhong, Zhiqing, Zhang, Jie M., He, Pinjia, Hassan, Ahmed E.
The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system's effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the "token snowball" effect and, more significantly, a pattern of "expensive failures". In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.
LongCat-Flash Technical Report
Meituan LongCat Team, null, Bayan, null, Li, Bei, Lei, Bingye, Wang, Bo, Rong, Bolin, Wang, Chao, Zhang, Chao, Gao, Chen, Zhang, Chen, Sun, Cheng, Han, Chengcheng, Xi, Chenguang, Zhang, Chi, Peng, Chong, Qin, Chuan, Zhang, Chuyu, Chen, Cong, Wang, Congkui, Ma, Dan, Pan, Daoru, Bu, Defei, Zhao, Dengchang, Kong, Deyang, Liu, Dishan, Huo, Feiye, Li, Fengcun, Zhang, Fubao, Dong, Gan, Liu, Gang, Xu, Gang, Li, Ge, Tan, Guoqiang, Lin, Guoyuan, Jing, Haihang, Fu, Haomin, Yan, Haonan, Wen, Haoxing, Zhao, Haozhe, Liu, Hong, Shi, Hongmei, Hao, Hongyan, Tang, Hongyin, Lv, Huantian, Su, Hui, Li, Jiacheng, Liu, Jiahao, Li, Jiahuan, Yang, Jiajun, Wang, Jiaming, Yang, Jian, Tan, Jianchao, Sun, Jiaqi, Zhang, Jiaqi, Fu, Jiawei, Yang, Jiawei, Hu, Jiaxi, Qin, Jiayu, Wang, Jingang, He, Jiyuan, Kuang, Jun, Mei, Junhui, Liang, Kai, He, Ke, Zhang, Kefeng, Wang, Keheng, He, Keqing, Gao, Liang, Shi, Liang, Ma, Lianhui, Qiu, Lin, Kong, Lingbin, Si, Lingtong, Lyu, Linkun, Guo, Linsen, Yang, Liqi, Yan, Lizhi, Xia, Mai, Gao, Man, Zhang, Manyuan, Zhou, Meng, Shen, Mengxia, Tuo, Mingxiang, Zhu, Mingyang, Li, Peiguang, Pei, Peng, Zhao, Peng, Jia, Pengcheng, Sun, Pingwei, Gu, Qi, Li, Qianyun, Li, Qingyuan, Huang, Qiong, Duan, Qiyuan, Meng, Ran, Weng, Rongxiang, Shao, Ruichen, Li, Rumei, Wu, Shizhe, Liang, Shuai, Wang, Shuo, Dang, Suogui, Fang, Tao, Li, Tao, Chen, Tefeng, Bai, Tianhao, Zhou, Tianhao, Xie, Tingwen, He, Wei, Huang, Wei, Liu, Wei, Shi, Wei, Wang, Wei, Wu, Wei, Zhao, Weikang, Zan, Wen, Shi, Wenjie, Nan, Xi, Su, Xi, Li, Xiang, Mei, Xiang, Ji, Xiangyang, Xi, Xiangyu, Huang, Xiangzhou, Li, Xianpeng, Fu, Xiao, Liu, Xiao, Wei, Xiao, Cai, Xiaodong, Chen, Xiaolong, Liu, Xiaoqing, Li, Xiaotong, Shi, Xiaowei, Li, Xiaoyu, Wang, Xili, Chen, Xin, Hu, Xing, Miao, Xingyu, He, Xinyan, Zhang, Xuemiao, Hao, Xueyuan, Cao, Xuezhi, Cai, Xunliang, Yang, Xurui, Feng, Yan, Bai, Yang, Chen, Yang, Yang, Yang, Huo, Yaqi, Sun, Yerui, Lu, Yifan, Zhang, Yifan, Zang, Yipeng, Zhai, Yitao, Li, Yiyang, Yin, Yongjing, Lv, Yongkang, Zhou, Yongwei, Yang, Yu, Xie, Yuchen, Sun, Yueqing, Zheng, Yuewen, Wei, Yuhuai, Qian, Yulei, Liang, Yunfan, Tai, Yunfang, Zhao, Yunke, Yu, Zeyang, Zhang, Zhao, Yang, Zhaohua, Zhang, Zhenchao, Xia, Zhikang, Zou, Zhiye, Zeng, Zhizhao, Su, Zhongda, Chen, Zhuofan, Zhang, Zijian, Wang, Ziwen, Jiang, Zixu, Zhao, Zizhe, Wang, Zongyu, Su, Zunhai
We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat
The Anatomy of a Personal Health Agent
Heydari, A. Ali, Gu, Ken, Srinivas, Vidya, Yu, Hong, Zhang, Zhihan, Zhang, Yuwei, Paruchuri, Akshay, He, Qian, Palangi, Hamid, Hammerquist, Nova, Metwally, Ahmed A., Winslow, Brent, Kim, Yubin, Ayush, Kumar, Yang, Yuzhe, Narayanswamy, Girish, Xu, Maxwell A., Garrison, Jake, Lee, Amy Armento, Vafeiadou, Jenny, Graef, Ben, Galatzer-Levy, Isaac R., Schenck, Erik, Barakat, Andrew, Perez, Javier, Shreibati, Jacqueline, Hernandez, John, Faranesh, Anthony Z., Prieto, Javier L., Heneghan, Connor, Liu, Yun, Zhan, Jiening, Malhotra, Mark, Patel, Shwetak, Althoff, Tim, Liu, Xin, McDuff, Daniel, Xu, Xuhai "Orson"
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users' needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.