Agents
SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems
He, Xu, Wu, Di, Zhai, Yan, Sun, Kun
The rise of large language model (LLM)-based multi-agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi-faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input-output level, and fall short in addressing systemic or multi-point failures in MAS. In this work, we present a system-level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single-point faults and prompt injections but also multi-agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft's Magentic-One system, demonstrating its ability to detect covert risks and provide explainable root-cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent-based AI ecosystems.
Don't Just Follow MLLM Plans: Robust and Efficient Planning for Open-world Agents
Lee, Seungjoon, Kim, Suhwan, Oh, Minhyeon, Yoon, Youngsik, Ok, Jungseul
Developing autonomous agents capable of mastering complex, multi-step tasks in unpredictable, interactive environments presents a significant challenge. While Large Language Models (LLMs) offer promise for planning, existing approaches often rely on problematic internal knowledge or make unrealistic environmental assumptions. Although recent work explores learning planning knowledge, they still retain limitations due to partial reliance on external knowledge or impractical setups. Indeed, prior research has largely overlooked developing agents capable of acquiring planning knowledge from scratch, directly in realistic settings. While realizing this capability is necessary, it presents significant challenges, primarily achieving robustness given the substantial risk of incorporating LLMs' inaccurate knowledge. Moreover, efficiency is crucial for practicality as learning can demand prohibitive exploration. In response, we introduce Robust and Efficient Planning for Open-world Agents (REPOA), a novel framework designed to tackle these issues. REPOA features three key components: adaptive dependency learning and fine-grained failure-aware operation memory to enhance robustness to knowledge inaccuracies, and difficulty-based exploration to improve learning efficiency. Our evaluation in two established open-world testbeds demonstrates REPOA's robust and efficient planning, showcasing its capability to successfully obtain challenging late-game items that were beyond the reach of prior approaches.
Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Fan, Chenyou, Yan, Fangzheng, Bai, Chenjia, Wang, Jiepeng, Zhang, Chi, Wang, Zhen, Li, Xuelong
Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.
Biological Pathway Guided Gene Selection Through Collaborative Reinforcement Learning
Azim, Ehtesamul, Wang, Dongjie, Hwang, Tae Hyun, Fu, Yanjie, Zhang, Wei
Gene selection in high-dimensional genomic data is essential for understanding disease mechanisms and improving therapeutic outcomes. Traditional feature selection methods effectively identify predictive genes but often ignore complex biological pathways and regulatory networks, leading to unstable and biologically irrelevant signatures. Prior approaches, such as Lasso-based methods and statistical filtering, either focus solely on individual gene-outcome associations or fail to capture pathway-level interactions, presenting a key challenge: how to integrate biological pathway knowledge while maintaining statistical rigor in gene selection? To address this gap, we propose a novel two-stage framework that integrates statistical selection with biological pathway knowledge using multi-agent reinforcement learning (MARL). First, we introduce a pathway-guided pre-filtering strategy that leverages multiple statistical methods alongside KEGG pathway information for initial dimensionality reduction. Next, for refined selection, we model genes as collaborative agents in a MARL framework, where each agent optimizes both predictive power and biological relevance. Our framework incorporates pathway knowledge through Graph Neural Network-based state representations, a reward mechanism combining prediction performance with gene centrality and pathway coverage, and collaborative learning strategies using shared memory and a centralized critic component. Extensive experiments on multiple gene expression datasets demonstrate that our approach significantly improves both prediction accuracy and biological interpretability compared to traditional methods.
Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multi-Agent Reinforcement Learning
Dai, Pengcheng, Mo, Yuanqiu, Yu, Wenwu, Ren, Wei
This paper studies the networked multi-agent reinforcement learning (NMARL) problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose a distributed neural policy gradient algorithm that features two innovatively designed neural networks, specifically for the approximate Q-functions and policy functions of agents. This distributed neural policy gradient algorithm consists of two key components: the distributed critic step and the decentralized actor step. In the distributed critic step, agents receive the approximate Q-function parameters from their neighboring agents via a time-varying communication networks to collaboratively evaluate the joint policy. In contrast, in the decentralized actor step, each agent updates its local policy parameter solely based on its own approximate Q-function. In the convergence analysis, we first establish the global convergence of agents for the joint policy evaluation in the distributed critic step. Subsequently, we rigorously demonstrate the global convergence of the overall distributed neural policy gradient algorithm with respect to the objective function. Finally, the effectiveness of the proposed algorithm is demonstrated by comparing it with a centralized algorithm through simulation in the robot path planning environment.
Estimating Misreporting in the Presence of Genuine Modification: A Causal Perspective
Zapzalka, Dylan, Chang, Trenton, Warrenburg, Lindsay, Park, Sae-Hwan, Shenfeld, Daniel K., Parikh, Ravi B., Wiens, Jenna, Makar, Maggie
In settings where ML models are used to inform the allocation of resources, agents affected by the allocation decisions might have an incentive to strategically change their features to secure better outcomes. While prior work has studied strategic responses broadly, disentangling misreporting from genuine modification remains a fundamental challenge. In this paper, we propose a causally-motivated approach to identify and quantify how much an agent misreports on average by distinguishing deceptive changes in their features from genuine modification. Our key insight is that, unlike genuine modification, misreported features do not causally affect downstream variables (i.e., causal descendants). We exploit this asymmetry by comparing the causal effect of misreported features on their causal descendants as derived from manipulated datasets against those from unmanipulated datasets. We formally prove identifiability of the misreporting rate and characterize the variance of our estimator. We empirically validate our theoretical results using a semi-synthetic and real Medicare dataset with misreported data, demonstrating that our approach can be employed to identify misreporting in real-world scenarios.
Combining Deep Architectures for Information Gain estimation and Reinforcement Learning for multiagent field exploration
Masiero, Emanuele, Trianni, Vito, Vizzari, Giuseppe, Ognibene, Dimitri
Precision agriculture requires efficient autonomous systems for crop monitoring, where agents must explore large-scale environments while minimizing resource consumption. This work addresses the problem as an active exploration task in a grid environment representing an agricultural field. Each cell may contain targets (e.g., damaged crops) observable from nine predefined points of view (POVs). Agents must infer the number of targets per cell using partial, sequential observations. We propose a two-stage deep learning framework. A pre-trained LSTM serves as a belief model, updating a probabilistic map of the environment and its associated entropy, which defines the expected information gain (IG). This allows agents to prioritize informative regions. A key contribution is the inclusion of a POV visibility mask in the input, preserving the Markov property under partial observability and avoiding revisits to already explored views. Three agent architectures were compared: an untrained IG-based agent selecting actions to maximize entropy reduction; a DQN agent using CNNs over local 3x3 inputs with belief, entropy, and POV mask; and a Double-CNN DQN agent with wider spatial context. Simulations on 20x20 maps showed that the untrained agent performs well despite its simplicity. The DQN agent matches this performance when the POV mask is included, while the Double-CNN agent consistently achieves superior exploration efficiency, especially in larger environments. Results show that uncertainty-aware policies leveraging entropy, belief states, and visibility tracking lead to robust and scalable exploration. Future work includes curriculum learning, multi-agent cooperation with shared rewards, transformer-based models, and intrinsic motivation mechanisms to further enhance learning efficiency and policy generalization.
VideoGameBench: Can Vision-Language Models complete popular video games?
Zhang, Alex L., Griffiths, Thomas L., Narasimhan, Karthik R., Press, Ofir
Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
Zheng, Junhao, Cai, Xidi, Li, Qiuke, Zhang, Duzhen, Li, ZhongZhi, Zhang, Yingying, Song, Le, Ma, Qianli
Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.
Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning
Goal-conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach-avoid tasks and propose pre-training graph neural network embeddings on "reach-avoid derived" DFAs.