Agents
CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
Ward, Francis Rhys, van der Weij, Teun, Gábor, Hanna, Martin, Sam, Moreno, Raja Mehta, Lidar, Harel, Makower, Louis, Jodrell, Thomas, Robson, Lauren
AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit
Tele-LLM-Hub: Building Context-Aware Multi-Agent LLM Systems for Telecom Networks
Gajjar, Pranshav, Shen, Cong, Shah, Vijay K
This paper introduces Tele-LLM-Hub, a user friendly low-code solution for rapid prototyping and deployment of context aware multi-agent (MA) Large Language Model (LLM) systems tailored for 5G and beyond. As telecom wireless networks become increasingly complex, intelligent LLM applications must share a domainspecific understanding of network state. We propose TeleMCP, the Telecom Model Context Protocol, to enable structured and context-rich communication between agents in telecom environments. Tele-LLM-Hub actualizes TeleMCP through a low-code interface that supports agent creation, workflow composition, and interaction with software stacks such as srsRAN. Key components include a direct chat interface, a repository of pre-built systems, an Agent Maker leveraging finetuning with our RANSTRUCT framework, and an MA-Maker for composing MA workflows. The goal of Tele-LLM-Hub is to democratize the design of contextaware MA systems and accelerate innovation in next-generation wireless networks.
Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration
Wu, Guanchen, Chen, Zuhui, Xie, Yuzhang, Yang, Carl
Protected health information (PHI) de-identification is critical for enabling the safe reuse of clinical notes, yet evaluating and comparing PHI de-identification models typically depends on costly, small-scale expert annotations. We present TEAM-PHI, a multi-agent evaluation and selection framework that uses large language models (LLMs) to automatically measure de-identification quality and select the best-performing model without heavy reliance on gold labels. TEAM-PHI deploys multiple Evaluation Agents, each independently judging the correctness of PHI extractions and outputting structured metrics. Their results are then consolidated through an LLM-based majority voting mechanism that integrates diverse evaluator perspectives into a single, stable, and reproducible ranking. Experiments on a real-world clinical note corpus demonstrate that TEAM-PHI produces consistent and accurate rankings: despite variation across individual evaluators, LLM-based voting reliably converges on the same top-performing systems. Further comparison with ground-truth annotations and human evaluation confirms that the framework's automated rankings closely match supervised evaluation. By combining independent evaluation agents with LLM majority voting, TEAM-PHI offers a practical, secure, and cost-effective solution for automatic evaluation and best-model selection in PHI de-identification, even when ground-truth labels are limited.
LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions
Lin, Xixun, Ning, Yucheng, Zhang, Jingwen, Dong, Yan, Liu, Yilong, Wu, Yongxuan, Qi, Xiaohua, Sun, Nan, Shang, Yanmin, Wang, Kun, Cao, Pengfei, Wang, Qingyue, Zou, Lixin, Chen, Xu, Zhou, Chuan, Wu, Jia, Zhang, Peng, Wen, Qingsong, Pan, Shirui, Wang, Bin, Cao, Yanan, Chen, Kai, Hu, Songlin, Guo, Li
Abstract--Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. T o this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems. Cao, K. Chen, S. Hu, and L. Guo are with Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China. K. Wang is with Nanyang Technological University, Singapore. Cao is with Institute of Automation, Chinese Academy of Sciences, Beijing, China. Q. Wang is with Hong Kong University of Science and Technology, Hong Kong, China. L. Zou is with School of Cyber Science and Engineering, Wuhan University, Wuhan, China. X. Chen is with Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China. C. Zhou is with Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. J. Wu is with School of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, Australia. Zhang is with the Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China. Q. Wen is with Squirrel Ai Learning, Bellevue, USA. S. Pan is with School of Information and Communication Technology, Griffith University, Gold Coast, Australia. B. Wang is with Xiaomi Company, Beijing, China.
Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare
Ekpo, Promise Osaine, La, Brian, Wiener, Thomas, Agarwal, Saesha, Agrawal, Arshia, Gonzalez-Pumariega, Gonzalo, Molu, Lekan P., Taylor, Angelique
Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
Quang, Ngoc Bui Lam, Binh, Nam Le Nguyen, Nguyen, Thanh-Huy, Nguyen, Le Thien Phuc, Nguyen, Quan, Bagci, Ulas
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.
AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection
Wang, Peiran, Liu, Yang, Lu, Yunfei, Cai, Yifeng, Chen, Hongbo, Yang, Qingyou, Zhang, Jie, Hong, Jue, Wu, Ye
Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non-transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent's runtime traces as graph-based intermediate representations with control and data flow described within; (2) a property registry that attaches security-relevant metadata of interacted tools \& data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis for sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can reduce the ASR to 3\%, with the utility drop only 1\%.
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Huang, Baixiang, Tan, Zhen, Wang, Haoran, Liu, Zijie, Li, Dawei, Payani, Ali, Liu, Huan, Chen, Tianlong, Shu, Kai
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
Macroprogramming: Concepts, State of the Art, and Opportunities of Macroscopic Behaviour Modelling
Macroprogramming refers to the theory and practice of conveniently expressing the macro(scopic) behaviour of a system using a single program. Macroprogramming approaches are motivated by the need of effectively capturing global/system-level aspects and the collective behaviour of a set of interacting components, while abstracting over low-level details. In the past, this style of programming has been primarily adopted to describe the data-processing logic in wireless sensor networks; recently, research forums on spatial computing, collective adaptive systems, and Internet-of-Things have provided renewed interest in macro-approaches. However, related contributions are still fragmented and lacking conceptual consistency. Therefore, to foster principled research, an integrated view of the field is provided, together with opportunities and challenges.
Microsoft's Agent 365 Tries to Be the AI Bot Boss
Microsoft's Agent 365 Wants to Help You Manage Your AI Bot Army Microsoft still sees AI agents as the future of work, and the enterprise software giant wants companies to be able to manage those agents just like human employees. A new tool from Microsoft called Agent 365 is designed to help businesses control their growing collection of robotic helpers. Agent 365 is not a platform for making enterprise AI tools; it's a way to manage them, as if they were human employees . Companies using generative AI agents in their digital workplace can use Agent 365 to organize their growing sprawl of bots, keep tabs on how they're performing, and tweak their settings. The tool is rolling out today in Microsoft's early access program.