Goto

Collaborating Authors

 reservation


DoorDash Reservations Scored America's Most Exclusive Restaurants

WIRED

After the rise (and fall) of reservation scalping, DoorDash and a host of apps are fighting to book you a seat at the country's most exclusive restaurants. At The Eighty-Six in Manhattan, exclusivity is the point. The luxe, 11-table steakhouse is the sort of place that lavishes caviar and aged mimolette cheese on its potatoes, and crows that your market-price duck was raised by one Dr. Taylor Swift has reportedly dined there in a Miu Miu skirt. Reservations are a scarce commodity that the restaurant, and New York law forbids you from selling one. "Access is the main asset," wrote food writer Helen Rosner in a recent New Yorker review of The Eighty-Six. "The product is the door, and what a door!



Why A.I. Didn't Transform Our Lives in 2025

The New Yorker

This was supposed to be the year when autonomous agents took over everyday tasks. One year ago, Sam Altman, the C.E.O. of OpenAI, made a bold prediction: "We believe that, in 2025, we may see the first AI agents'join the workforce' and materially change the output of companies." A couple of weeks later, the company's chief product officer, Kevin Weil, said at the World Economic Forum conference at Davos in January, "I think 2025 is the year that we go from ChatGPT being this super smart thing . . . to ChatGPT doing things in the real world for you." He gave examples of artificial intelligence filling out online forms and booking restaurant reservations. He later promised, "We're going to be able to do that, no question."


SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

Cuadron, Alejandro, Yu, Pengfei, Liu, Yang, Gupta, Arpit

arXiv.org Artificial Intelligence

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92\%$ on Airline and upto $96\%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28\% \emph{relative} on Airline, +11\% on Retail, and +7\% on SWE-Bench Verified; Claude: +9\%/+7\%. We further identify ceiling effects in $τ$-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release $τ$-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.


Non-Collaborative User Simulators for Tool Agents

Shim, Jeonghoon, Song, Woojung, Jin, Cheyon, KooK, Seungwon, Jo, Yohan

arXiv.org Artificial Intelligence

Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.


Reading Between the Lines: The One-Sided Conversation Problem

Ebert, Victoria, Singh, Rishabh, Chen, Tuochao, Smith, Noah A., Gollakota, Shyamnath

arXiv.org Artificial Intelligence

Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.


Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Nöther, Jonathan, Singla, Adish, Radanovic, Goran

arXiv.org Artificial Intelligence

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS


Towards Enforcing Company Policy Adherence in Agentic Workflows

Zwerdling, Naama, Boaz, David, Rabinovich, Ella, Uziel, Guy, Amid, David, Anaby-Tavor, Ateret

arXiv.org Artificial Intelligence

Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $τ$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.



Aegis: Taxonomy and Optimizations for Overcoming Agent-Environment Failures in LLM Agents

Song, Kevin, Jayarajan, Anand, Ding, Yaoyao, Su, Qidong, Zhu, Zhanda, Liu, Sihang, Pekhimenko, Gennady

arXiv.org Artificial Intelligence

Large Language Models (LLMs) agents augmented with domain tools promise to autonomously execute complex tasks requiring human-level intelligence, such as customer service and digital assistance. However, their practical deployment is often limited by their low success rates under complex real-world environments. To tackle this, prior research has primarily focused on improving the agents themselves, such as developing strong agentic LLMs, while overlooking the role of the system environment in which the agent operates. In this paper, we study a complementary direction: improving agent success rates by optimizing the system environment in which the agent operates. We collect 142 agent traces (3,656 turns of agent-environment interactions) across 5 state-of-the-art agentic benchmarks. By analyzing these agent failures, we propose a taxonomy for agent-environment interaction failures that includes 6 failure modes. Guided by these findings, we design Aegis, a set of targeted environment optimizations: 1) environment observability enhancement, 2) common computation offloading, and 3) speculative agentic actions. These techniques improve agent success rates on average by 6.7-12.5%, without any modifications to the agent and underlying LLM.