Agents
Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets
Bansal, Gagan, Hua, Wenyue, Huang, Zezhou, Fourney, Adam, Swearngin, Amanda, Epperson, Will, Payne, Tyler, Hofman, Jake M., Lucier, Brendan, Singh, Chinmay, Mobius, Markus, Nambi, Akshay, Yadav, Archana, Gao, Kevin, Rothschild, David M., Slivkins, Aleksandrs, Goldstein, Daniel G., Mozannar, Hussein, Immorlica, Nicole, Murad, Maya, Vogel, Matthew, Kambhampati, Subbarao, Horvitz, Eric, Amershi, Saleema
As LLM agents advance, they are increasingly mediating economic decisions, ranging from product discovery to transactions, on behalf of users. Such applications promise benefits but also raise many questions about agent accountability and value for users. Addressing these questions requires understanding how agents behave in realistic market conditions. However, previous research has largely evaluated agents in constrained settings, such as single-task marketplaces (e.g., negotiation) or structured two-agent interactions. Real-world markets are fundamentally different: they require agents to handle diverse economic activities and coordinate within large, dynamic ecosystems where multiple agents with opaque behaviors may engage in open-ended dialogues. To bridge this gap, we investigate two-sided agentic marketplaces where Assistant agents represent consumers and Service agents represent competing businesses. To study these interactions safely, we develop Magentic-Marketplace -- a simulated environment where Assistants and Services can operate. This environment enables us to study key market dynamics: the utility agents achieve, behavioral biases, vulnerability to manipulation, and how search mechanisms shape market outcomes. Our experiments show that frontier models can approach optimal welfare -- but only under ideal search conditions. Performance degrades sharply with scale, and all models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality. These findings reveal how behaviors emerge across market conditions, informing the design of fair and efficient agentic marketplaces.
Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Shen, Shannon Zejiang, Chen, Valerie, Gu, Ken, Ross, Alexis, Ma, Zixian, Ross, Jillian, Gu, Alex, Si, Chenglei, Chi, Wayne, Peng, Andi, Shen, Jocelyn J, Talwalkar, Ameet, Wu, Tongshuang, Sontag, David
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
Pasternak, Gil, Rajagopal, Dheeraj, White, Julia, Atreja, Dhruv, Thomas, Matthew, Hurn-Maloney, George, Lewis, Ash
From these personas, we synthetically construct comprehensive world models that encode: Workplace hierarchy and relationship context Work patterns and communication styles Available action space A with corresponding parameter spaces P Pain points and operational constraints For instance, given a senior account manager with 20 years of client-facing experience as shown in figure 2, the world model might identify "client documentation upkeep" as a pain point, while also modeling specific client relationships and their respective engagement contexts. Bottleneck Generation: Using the contextualized world model, we generate bottleneck b: a persona-relevant, actionable user-need that satisfies our formal definition (see Section 2). Each bottleneck b is designed to be identifiable through evidence T in the document set D and resolvable through exactly one action a A. User Datastore: For each sample S, we construct the document set D = T K. The True positives T - documents where f(d) = 1 - collectively provide sufficient evidence to identify bottleneck b. Distractors K are documents where f(d) = 0, introducing realistic noise with respect to the bottleneck. In our current datastore setup, all the generated documents are either emails, calendar events, or text documents, as exemplified in Figures 1 and 2. To mirror real-world complexity, we employ two key design principles: (i) Evidence distribution: We often distribute evidence for b across multiple documents in T, requiring agents to synthesize information from t different sources.
Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
Banerjee, Ashmi, Satish, Adithi, Aisyah, Fitri Nur, Wรถrndl, Wolfgang, Deldjoo, Yashar
We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents -- Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.
Microsoft's newest AI agent lets you 'vibe code' apps and automations
When you purchase through links in our articles, we may earn a small commission. Microsoft's newest AI agent lets you'vibe code' apps and automations It's all done using conversational language and the whole process should only take a few minutes. Microsoft has launched App Builder, a new AI agent that can be used to create apps using conversational language in just a few minutes. App Builder is based on Copilot and has a familiar workflow: describe what the app should do and it handles the rest automatically. Microsoft says that App Builder can be used to create apps that assign tasks, track milestones, and view campaign progress; to automate flows like sending daily updates and posting reminders in Teams channels; and to build AI agents that use SharePoint resources and Teams conversations as their basis of knowledge and training.
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Kuntz, Thomas, Duzan, Agatha, Zhao, Hao, Croce, Francesco, Kolter, Zico, Flammarion, Nicolas, Andriushchenko, Maksym
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour
Gyevnรกr, Bรกlint, Lucas, Christopher G., Albrecht, Stefano V., Cohen, Shay B.
Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks such as miscoordination or goal misalignment. Explainability is vital for users' trust calibration, but explainable MAS face challenges due to complex environments, the human factor, and non-standardised evaluation. Leveraging the counterfactual effect size model and LLMs, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates human-centred action explanations for multi-agent policies by having an LLM interrogate an environment simulator using prompts like 'whatif' and 'remove' to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across ten scenarios for five LLMs with a comprehensive methodology combining robustness, subjective preference, correctness, and goal/action prediction with an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for four models, with comparable action prediction accuracy, achieving the highest scores overall. Our code is open-sourced at https://github.com/gyevnarb/axis.
TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling
Hu, He, Zhou, Yucheng, Ma, Chiyuan, Wang, Qianning, Zhang, Zheng, Ma, Fei, Cui, Laizhong, Tian, Qi
Large language models (LLMs) in psychological counseling have attracted increasing attention. However, existing approaches often lack emotional understanding, adaptive strategies, and the use of therapeutic methods across multiple sessions with long-term memory, leaving them far from real clinical practice. To address these critical gaps, we introduce TheraMind, a strategic and adaptive agent for longitudinal psychological counseling. The cornerstone of TheraMind is a novel dual-loop architecture that decouples the complex counseling process into an Intra-Session Loop for tactical dialogue management and a Cross-Session Loop for strategic therapeutic planning. The Intra-Session Loop perceives the patient's emotional state to dynamically select response strategies while leveraging cross-session memory to ensure continuity. Crucially, the Cross-Session Loop empowers the agent with long-term adaptability by evaluating the efficacy of the applied therapy after each session and adjusting the method for subsequent interactions. We validate our approach in a high-fidelity simulation environment grounded in real clinical cases. Extensive evaluations show that TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validating the effectiveness of its dual-loop design in emulating strategic, adaptive, and longitudinal therapeutic behavior. The code is publicly available at https://0mwwm0.github.io/TheraMind/.
Counterfactual-based Agent Influence Ranker for Agentic AI Workflows
Giloni, Amit, Picardi, Chiara, Betser, Roy, Bose, Shamik, Sabapathy, Aishvariya Priya Rathina, Vainshtein, Roman
An Agentic AI Workflow (AAW), also known as an LLM-based multi-agent system, is an autonomous system that assembles several LLM-based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW's final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual-based Agent Influence Ranker (CAIR) - the first method for assessing the influence level of each agent on the AAW's output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task-agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.
Incorporating Social Awareness into Control of Unknown Multi-Agent Systems: A Real-Time Spatiotemporal Tubes Approach
Upadhyay, Siddhartha, Das, Ratnangshu, Jagtap, Pushpak
This paper presents a decentralized control framework that incorporates social awareness into multi-agent systems with unknown dynamics to achieve prescribed-time reach-avoid-stay tasks in dynamic environments. Each agent is assigned a social awareness index that quantifies its level of cooperation or self-interest, allowing heterogeneous social behaviors within the system. Building on the spatiotemporal tube (STT) framework, we propose a real-time STT framework that synthesizes tubes online for each agent while capturing its social interactions with others. A closed-form, approximation-free control law is derived to ensure that each agent remains within its evolving STT, thereby avoiding dynamic obstacles while also preventing inter-agent collisions in a socially aware manner, and reaching the target within a prescribed time. The proposed approach provides formal guarantees on safety and timing, and is computationally lightweight, model-free, and robust to unknown disturbances. The effectiveness and scalability of the framework are validated through simulation and hardware experiments on a 2D omnidirectional