Agents
LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game
Liang, Fangzhou, Zheng, Tianshi, Chan, Chunkit, Yim, Yauwai, Song, Yangqiu
Effective multi-agent collaboration requires agents to infer the rationale behind others' actions, a capability rooted in Theory-of-Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others' intent) correlates more strongly with performance than second-order ToM (predicting others' interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner's rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.
Video Game Level Design as a Multi-Agent Reinforcement Learning Problem
Earle, Sam, Jiang, Zehua, Vinitsky, Eugene, Togelius, Julian
Procedural Content Generation via Reinforcement Learning (PCGRL) offers a method for training controllable level designer agents without the need for human datasets, using metrics that serve as proxies for level quality as rewards. Existing PCGRL research focuses on single generator agents, but are bottlenecked by the need to frequently recalculate heuristics of level quality and the agent's need to navigate around potentially large maps. By framing level generation as a multi-agent problem, we mitigate the efficiency bottleneck of single-agent PCGRL by reducing the number of reward calculations relative to the number of agent actions. We also find that multi-agent level generators are better able to generalize to out-of-distribution map shapes, which we argue is due to the generators' learning more local, modular design policies. We conclude that treating content generation as a distributed, multi-agent task is beneficial for generating functional artifacts at scale.
RobustFlow: Towards Robust Agentic Workflow Generation
Xu, Shengxiang, Zhang, Jiayi, Di, Shimin, Luo, Yuyu, Yao, Liang, Liu, Hanmo, Zhu, Jia, Liu, Fan, Zhang, Min-Ling
The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical, unaddressed challenge. Current methods often generate wildly inconsistent workflows when provided with instructions that are semantically identical but differently phrased. This brittleness severely undermines their reliability and trustworthiness for real-world applications. To quantitatively diagnose this instability, we propose metrics based on nodal and topological similarity to evaluate workflow consistency against common semantic variations such as paraphrasing and noise injection. Subsequently, we further propose a novel training framework, RobustFlow, that leverages preference optimization to teach models invariance to instruction variations. By training on sets of synonymous task descriptions, RobustFlow boosts workflow robustness scores to 70\% - 90\%, which is a substantial improvement over existing approaches. The code is publicly available at https://github.com/DEFENSE-SEU/RobustFlow.
AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation
Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG's ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.
OpenCUA: Open Foundations for Computer-Use Agents
Wang, Xinyuan, Wang, Bowen, Lu, Dunjie, Yang, Junlin, Xie, Tianbao, Wang, Junli, Deng, Jiaqi, Guo, Xiaole, Xu, Yiheng, Wu, Chen Henry, Shen, Zhennan, Li, Zhuokai, Li, Ryan, Li, Xiaochuan, Chen, Junda, Zheng, Boyuan, Li, Peihang, Lei, Fangyu, Cao, Ruisheng, Fu, Yeqiao, Shin, Dongchan, Shin, Martin, Hu, Jiarui, Wang, Yuyan, Chen, Jixuan, Ye, Yuxiao, Zhang, Danyang, Du, Dikang, Hu, Hao, Chen, Huarong, Zhou, Zaida, Yao, Haotian, Chen, Ziwei, Gu, Qizheng, Wang, Yipu, Wang, Heng, Yang, Diyi, Zhong, Victor, Sung, Flood, Charles, Y., Yang, Zhilin, Yu, Tao
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems
Xie, Yizhe, Zhu, Congcong, Zhang, Xinyue, Zhu, Tianqing, Ye, Dayong, Wang, Minghao, Liu, Chi
Multi-agent systems powered by Large Language Models (LLM-MAS) have demonstrated remarkable capabilities in collaborative problem-solving. However, their deployment also introduces new security risks. Existing research on LLM-based agents has primarily examined single-agent scenarios, while the security of multi-agent systems remains largely unexplored. To address this gap, we present a systematic study of intention-hiding threats in LLM-MAS. We design four representative attack paradigms that subtly disrupt task completion while maintaining a high degree of stealth, and evaluate them under centralized, decentralized, and layered communication structures. Experimental results show that these attacks are highly disruptive and can easily evade existing defense mechanisms. To counter these threats, we propose AgentXposed, a psychology-inspired detection framework. AgentXposed draws on the HEXACO personality model, which characterizes agents through psychological trait dimensions, and the Reid interrogation technique, a structured method for eliciting concealed intentions. By combining progressive questionnaire probing with behavior-based inter-agent monitoring, the framework enables the proactive identification of malicious agents before harmful actions are carried out. Extensive experiments across six datasets against both our proposed attacks and two baseline threats demonstrate that AgentXposed effectively detects diverse forms of malicious behavior, achieving strong robustness across multiple communication settings.
Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards
Wu, Shirley, Sarthi, Parth, Zhao, Shiyu, Lee, Aaron, Shandilya, Herumb, Grobelnik, Adrian Mladenic, Choudhary, Nurendra, Huang, Eddie, Subbian, Karthik, Zhang, Linjun, Yang, Diyi, Zou, James, Leskovec, Jure
Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Liu, Mickel, Jiang, Liwei, Liang, Yancheng, Du, Simon Shaolei, Choi, Yejin, Althoff, Tim, Jaques, Natasha
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
GX-Chen, Anthony, Lin, Dongyan, Samiei, Mandana, Precup, Doina, Richards, Blake A., Fergus, Rob, Marino, Kenneth
Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.
OneDSE: A Unified Microprocessor Metric Prediction and Design Space Exploration Framework
Raj, Ritik, Ramachandran, Akshat, Nye, Jeff, Nemawarkar, Shashank, Krishna, Tushar
With the slowing of Moores Law and increasing impact of power constraints, processor designs rely on architectural innovation to achieve differentiating performance. However, the innovation complexity has simultaneously increased the design space of modern high performance processors. Specifically, we identify two key challenges in prior Design Space Exploration (DSE) approaches for modern CPU design - (a) cost model (prediction method) is either slow or microarchitecture-specific or workload-specific and single model is inefficient to learn the whole design space (b) optimization (exploration method) is slow and inaccurate in the large CPU parameter space. This work presents a novel solution called OneDSE to address these emerging challenges in modern CPU design. OneDSE is a unified cost model (metric predictor) and optimizer (CPU parameter explorer) with three key techniques - 1. Transformer-based workload-Aware CPU Estimation (TrACE) framework to predict metrics in the parameter space (TrACE-p) and parameters in the in the metric space (TrACE-m). TrACE-p outperforms State of The Art (SOTA) IPC prediction methods by 5.71x and 28x for single and multiple workloads respectively while being two orders of magnitude faster. 2. We also propose a novel Metric spAce Search opTimizer (MAST) that leverages TrACE-m and outperforms SoTA metaheuristics by 1.19x while being an order of magnitude faster. 3. We propose Subsystem-based Multi-Agent Reinforcement-learning based fine-Tuning (SMART)-TrACE that achieves a 10.6% reduction in prediction error compared to TrACE, enabling more accurate and efficient exploration of the CPU design space.