AITopics

Safe and efficient interaction between autonomous vehicles (AVs) and human-driven vehicles (HVs) is a critical challenge for future transportation systems. While game-theoretic models capture how AVs influence HVs, they often suffer from a long-term decay of influence and can be perceived as manipulative, eroding the human's trust. This can paradoxically lead to riskier human driving behavior over repeated interactions. In this paper, we address this challenge by proposing the Trust-Aware Embodied Bayesian Persuasion (TA-EBP) framework. Our work makes three key contributions: First, we apply Bayesian persuasion to model communication at traffic intersections, offering a transparent alternative to traditional game-theoretic models. Second, we introduce a trust parameter to the persuasion framework, deriving a theorem for the minimum trust level required for influence. Finally, we ground the abstract signals of Bayesian persuasion theory into a continuous, physically meaningful action space, deriving a second theorem for the optimal signal magnitude, realized as an AV's forward nudge. Additionally, we validate our framework in a mixed-autonomy traffic simulation, demonstrating that TA-EBP successfully persuades HVs to drive more cautiously, eliminating collisions and improving traffic flow compared to baselines that either ignore trust or lack communication. Our work provides a transparent and non-strategic framework for influence in human-robot interaction, enhancing both safety and efficiency.

artificial intelligence, machine learning, receiver, (15 more...)

2509.15404

Country: North America > United States > Illinois (0.28)

Genre: Research Report (1.00)

Industry:

Transportation (0.34)
Consumer Products & Services > Travel (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Zhang, Tiannan, Veerapaneni, Rishi, Chan, Shao-Hung, Li, Jiaoyang, Likhachev, Maxim

Dynamic Agent Grouping ECBS: Scaling Windowed Multi-Agent Path Finding with Completeness Guarantees

Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths for a team of agents. Although several MAPF methods which solve full-horizon MAPF have completeness guarantees, very few MAPF methods that plan partial paths have completeness guarantees. Recent work introduced the Windowed Complete MAPF (WinC-MAPF) framework, which shows how windowed optimal MAPF solvers (e.g., SS-CBS) can use heuristic updates and disjoint agent groups to maintain completeness even when planning partial paths (V eerapaneni et al. 2024). A core limitation of WinC-MAPF is that they required optimal MAPF solvers. Our main contribution is to extend WinC-MAPF by showing how we can use a bounded suboptimal solver while maintaining completeness. In particular, we design Dynamic Agent Grouping ECBS (DAG-ECBS) which dynamically creates and plans agent groups while maintaining that each agent group solution is bounded suboptimal. We prove how DAG-ECBS can maintain completeness in the WinC-MAPF framework. DAG-ECBS shows improved scalability compared to SS-CBS and can outperform windowed ECBS without completeness guarantees. More broadly, our work serves as a blueprint for designing more MAPF methods that can use the WinC-MAPF framework.

agent, agent group, artificial intelligence, (13 more...)

2509.15381

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.80)

Sorstkins, Andrejs, Bailey, Josh, Baron, Dr Alistair

Diagnostics of cognitive failures in multi-agent expert systems using dynamic evaluation protocols and subsequent mutation of the processing context

The rapid evolution of neural architectures - from multilayer perceptrons to large-scale Transformer-based models - has enabled language models (LLMs) to exhibit emergent agentic behaviours when equipped with memory, planning, and external tool use. However, their inherent stochasticity and multi-step decision processes render classical evaluation methods inadequate for diagnosing agentic performance. This work introduces a diagnostic framework for expert systems that not only evaluates but also facilitates the transfer of expert behaviour into LLM-powered agents. The framework integrates (i) curated golden datasets of expert annotations, (ii) silver datasets generated through controlled behavioural mutation, and (iii) an LLM-based Agent Judge that scores and prescribes targeted improvements. These prescriptions are embedded into a vectorized recommendation map, allowing expert interventions to propagate as reusable improvement trajectories across multiple system instances. We demonstrate the framework on a multi-agent recruiter-assistant system, showing that it uncovers latent cognitive failures - such as biased phrasing, extraction drift, and tool misrouting - while simultaneously steering agents toward expert-level reasoning and style. The results establish a foundation for standardized, reproducible expert behaviour transfer in stochastic, tool-augmented LLM agents, moving beyond static evaluation to active expert system refinement.

large language model, machine learning, natural language, (21 more...)

2509.15366

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Generating Plans for Belief-Desire-Intention (BDI) Agents Using Alternating-Time Temporal Logic (ATL)

Léveillé, Dylan

Belief-Desire-Intention (BDI) is a framework for modelling agents based on their beliefs, desires, and intentions. Plans are a central component of BDI agents, and define sequences of actions that an agent must undertake to achieve a certain goal. Existing approaches to plan generation often require significant manual effort, and are mainly focused on single-agent systems. As a result, in this work, we have developed a tool that automatically generates BDI plans using Alternating-Time Temporal Logic (ATL). By using ATL, the plans generated accommodate for possible competition or cooperation between the agents in the system. We demonstrate the effectiveness of the tool by generating plans for an illustrative game that requires agent collaboration to achieve a shared goal. We show that the generated plans allow the agents to successfully attain this goal.

agent, artificial intelligence, planning & scheduling, (16 more...)

doi: 10.4204/EPTCS.428.10

2509.15238

Country:

Europe (0.46)
North America > Canada (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning

Li, Simin, Yuwei, Zheng, Mao, Zihao, Wang, Linhao, Xu, Ruixiao, Ma, Chengdong, Yu, Xin, Ma, Yuqing, Dou, Qi, Wang, Xin, Luo, Jie, An, Bo, Yang, Yaodong, Lv, Weifeng, Liu, Xianglong

Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance. In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where the upper level involves an NP-hard combinatorial task of selecting the most vulnerable agents, and the lower level learns worst-case adversarial policies for these agents using mean-field MARL. The two problems are coupled together, making HAD-MFC difficult to solve. To solve this, we first decouple the hierarchical process by Fenchel-Rockafellar transform, resulting a regularized mean-field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We then reformulate the upper-level combinatorial problem as a MDP with dense rewards from our regularized mean-field Bellman operator, enabling us to sequentially identify the most vulnerable agents by greedy and RL algorithms. This decomposition provably preserves the optimal solution of the original HAD-MFC. Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learns a value function that reveals the vulnerability of each agent.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2509.15103

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring

Jang, Jinhee, Moon, Ayoung, Jung, Minkyoung, Kim, YoungBin, Lee, Seung Jin

The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long-standing and practical application of natural language processing in education. However, achieving human-level multi-perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi-agent evaluation framework designed to perform precise and human-aligned scoring under a zero-shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait-based rubric and conducts a multi-perspective evaluation. Then, by simulating a roundtable-style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero-shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.

large language model, machine learning, natural language, (20 more...)

2509.14834

Genre: Research Report (1.00)

Industry:

Education > Educational Setting (1.00)
Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.49)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

Online Learning of Deceptive Policies under Intermittent Observation

Puthumanaillam, Gokul, Padmanabhan, Ram, Fuentes, Jose, Cruz, Nicole, Padrao, Paulo, Hernandez, Ruben, Jiang, Hao, Schafer, William, Bobadilla, Leonardo, Ornik, Melkior

In supervisory control settings, autonomous systems are not monitored continuously. Instead, monitoring often occurs at sporadic intervals within known bounds. We study the problem of deception, where an agent pursues a private objective while remaining plausibly compliant with a supervisor's reference policy when observations occur. Motivated by the behavior of real, human supervisors, we situate the problem within Theory of Mind: the representation of what an observer believes and expects to see. We show that Theory of Mind can be repurposed to steer online reinforcement learning (RL) toward such deceptive behavior. We model the supervisor's expectations and distill from them a single, calibrated scalar -- the expected evidence of deviation if an observation were to happen now. This scalar combines how unlike the reference and current action distributions appear, with the agent's belief that an observation is imminent. Injected as a state-dependent weight into a KL-regularized policy improvement step within an online RL loop, this scalar informs a closed-form update that smoothly trades off self-interest and compliance, thus sidestepping hand-crafted or heuristic policies. In real-world, real-time hardware experiments on marine (ASV) and aerial (UAV) navigation, our ToM-guided RL runs online, achieves high return and success with observed-trace evidence calibrated to the supervisor's expectations.

artificial intelligence, machine learning, supervisor, (12 more...)

2509.14453

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints

Fan, Zhiyu, Vasilevski, Kirill, Lin, Dayi, Chen, Boyuan, Chen, Yihao, Zhong, Zhiqing, Zhang, Jie M., He, Pinjia, Hassan, Ahmed E.

The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system's effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the "token snowball" effect and, more significantly, a pattern of "expensive failures". In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.

large language model, machine learning, scaffold, (18 more...)

2509.09853

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

LongCat-Flash Technical Report

Meituan LongCat Team, null, Bayan, null, Li, Bei, Lei, Bingye, Wang, Bo, Rong, Bolin, Wang, Chao, Zhang, Chao, Gao, Chen, Zhang, Chen, Sun, Cheng, Han, Chengcheng, Xi, Chenguang, Zhang, Chi, Peng, Chong, Qin, Chuan, Zhang, Chuyu, Chen, Cong, Wang, Congkui, Ma, Dan, Pan, Daoru, Bu, Defei, Zhao, Dengchang, Kong, Deyang, Liu, Dishan, Huo, Feiye, Li, Fengcun, Zhang, Fubao, Dong, Gan, Liu, Gang, Xu, Gang, Li, Ge, Tan, Guoqiang, Lin, Guoyuan, Jing, Haihang, Fu, Haomin, Yan, Haonan, Wen, Haoxing, Zhao, Haozhe, Liu, Hong, Shi, Hongmei, Hao, Hongyan, Tang, Hongyin, Lv, Huantian, Su, Hui, Li, Jiacheng, Liu, Jiahao, Li, Jiahuan, Yang, Jiajun, Wang, Jiaming, Yang, Jian, Tan, Jianchao, Sun, Jiaqi, Zhang, Jiaqi, Fu, Jiawei, Yang, Jiawei, Hu, Jiaxi, Qin, Jiayu, Wang, Jingang, He, Jiyuan, Kuang, Jun, Mei, Junhui, Liang, Kai, He, Ke, Zhang, Kefeng, Wang, Keheng, He, Keqing, Gao, Liang, Shi, Liang, Ma, Lianhui, Qiu, Lin, Kong, Lingbin, Si, Lingtong, Lyu, Linkun, Guo, Linsen, Yang, Liqi, Yan, Lizhi, Xia, Mai, Gao, Man, Zhang, Manyuan, Zhou, Meng, Shen, Mengxia, Tuo, Mingxiang, Zhu, Mingyang, Li, Peiguang, Pei, Peng, Zhao, Peng, Jia, Pengcheng, Sun, Pingwei, Gu, Qi, Li, Qianyun, Li, Qingyuan, Huang, Qiong, Duan, Qiyuan, Meng, Ran, Weng, Rongxiang, Shao, Ruichen, Li, Rumei, Wu, Shizhe, Liang, Shuai, Wang, Shuo, Dang, Suogui, Fang, Tao, Li, Tao, Chen, Tefeng, Bai, Tianhao, Zhou, Tianhao, Xie, Tingwen, He, Wei, Huang, Wei, Liu, Wei, Shi, Wei, Wang, Wei, Wu, Wei, Zhao, Weikang, Zan, Wen, Shi, Wenjie, Nan, Xi, Su, Xi, Li, Xiang, Mei, Xiang, Ji, Xiangyang, Xi, Xiangyu, Huang, Xiangzhou, Li, Xianpeng, Fu, Xiao, Liu, Xiao, Wei, Xiao, Cai, Xiaodong, Chen, Xiaolong, Liu, Xiaoqing, Li, Xiaotong, Shi, Xiaowei, Li, Xiaoyu, Wang, Xili, Chen, Xin, Hu, Xing, Miao, Xingyu, He, Xinyan, Zhang, Xuemiao, Hao, Xueyuan, Cao, Xuezhi, Cai, Xunliang, Yang, Xurui, Feng, Yan, Bai, Yang, Chen, Yang, Yang, Yang, Huo, Yaqi, Sun, Yerui, Lu, Yifan, Zhang, Yifan, Zang, Yipeng, Zhai, Yitao, Li, Yiyang, Yin, Yongjing, Lv, Yongkang, Zhou, Yongwei, Yang, Yu, Xie, Yuchen, Sun, Yueqing, Zheng, Yuewen, Wei, Yuhuai, Qian, Yulei, Liang, Yunfan, Tai, Yunfang, Zhao, Yunke, Yu, Zeyang, Zhang, Zhao, Yang, Zhaohua, Zhang, Zhenchao, Xia, Zhikang, Zou, Zhiye, Zeng, Zhizhao, Su, Zhongda, Chen, Zhuofan, Zhang, Zijian, Wang, Ziwen, Jiang, Zixu, Zhao, Zizhe, Wang, Zongyu, Su, Zunhai

We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2509.01322

Country:

North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.45)

Industry: Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)

The Anatomy of a Personal Health Agent

Heydari, A. Ali, Gu, Ken, Srinivas, Vidya, Yu, Hong, Zhang, Zhihan, Zhang, Yuwei, Paruchuri, Akshay, He, Qian, Palangi, Hamid, Hammerquist, Nova, Metwally, Ahmed A., Winslow, Brent, Kim, Yubin, Ayush, Kumar, Yang, Yuzhe, Narayanswamy, Girish, Xu, Maxwell A., Garrison, Jake, Lee, Amy Armento, Vafeiadou, Jenny, Graef, Ben, Galatzer-Levy, Isaac R., Schenck, Erik, Barakat, Andrew, Perez, Javier, Shreibati, Jacqueline, Hernandez, John, Faranesh, Anthony Z., Prieto, Javier L., Heneghan, Connor, Liu, Yun, Zhan, Jiening, Malhotra, Mark, Patel, Shwetak, Althoff, Tim, Liu, Xin, McDuff, Daniel, Xu, Xuhai "Orson"

Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users' needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.

artificial intelligence, machine learning, natural language, (18 more...)

2508.20148

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
(2 more...)

Industry:

Leisure & Entertainment > Sports (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)