reservation
DoorDash Reservations Scored America's Most Exclusive Restaurants
After the rise (and fall) of reservation scalping, DoorDash and a host of apps are fighting to book you a seat at the country's most exclusive restaurants. At The Eighty-Six in Manhattan, exclusivity is the point. The luxe, 11-table steakhouse is the sort of place that lavishes caviar and aged mimolette cheese on its potatoes, and crows that your market-price duck was raised by one Dr. Taylor Swift has reportedly dined there in a Miu Miu skirt. Reservations are a scarce commodity that the restaurant, and New York law forbids you from selling one. "Access is the main asset," wrote food writer Helen Rosner in a recent New Yorker review of The Eighty-Six. "The product is the door, and what a door!
- North America > United States > Pennsylvania (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Nevada > Clark County > Las Vegas (0.05)
- (9 more...)
- Information Technology > Services (0.96)
- Consumer Products & Services > Restaurants (0.89)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.71)
- Banking & Finance > Trading (0.68)
- Information Technology > Communications > Mobile (0.96)
- Information Technology > Artificial Intelligence (0.72)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Why A.I. Didn't Transform Our Lives in 2025
This was supposed to be the year when autonomous agents took over everyday tasks. One year ago, Sam Altman, the C.E.O. of OpenAI, made a bold prediction: "We believe that, in 2025, we may see the first AI agents'join the workforce' and materially change the output of companies." A couple of weeks later, the company's chief product officer, Kevin Weil, said at the World Economic Forum conference at Davos in January, "I think 2025 is the year that we go from ChatGPT being this super smart thing . . . to ChatGPT doing things in the real world for you." He gave examples of artificial intelligence filling out online forms and booking restaurant reservations. He later promised, "We're going to be able to do that, no question."
- North America > United States > California (0.15)
- North America > United States > New York (0.04)
- North America > Mexico (0.04)
- Atlantic Ocean > Gulf of Mexico (0.04)
- Leisure & Entertainment (0.95)
- Consumer Products & Services (0.95)
- Banking & Finance > Economy (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.90)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents
Cuadron, Alejandro, Yu, Pengfei, Liu, Yang, Gupta, Arpit
Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92\%$ on Airline and upto $96\%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28\% \emph{relative} on Airline, +11\% on Retail, and +7\% on SWE-Bench Verified; Claude: +9\%/+7\%. We further identify ceiling effects in $τ$-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release $τ$-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Research Report > New Finding (0.86)
- Research Report > Experimental Study (0.66)
- Transportation > Passenger (0.70)
- Transportation > Air (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Non-Collaborative User Simulators for Tool Agents
Shim, Jeonghoon, Song, Woojung, Jin, Cheyon, KooK, Seungwon, Jo, Yohan
Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Middle East > Malta (0.04)
- Asia > India (0.04)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine (1.00)
- Consumer Products & Services > Travel (1.00)
- (3 more...)
Reading Between the Lines: The One-Sided Conversation Problem
Ebert, Victoria, Singh, Rishabh, Chen, Tuochao, Smith, Noah A., Gollakota, Shyamnath
Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia (0.04)
- (24 more...)
- Personal > Interview (0.67)
- Research Report > New Finding (0.45)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (7 more...)
Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms
Nöther, Jonathan, Singla, Adish, Radanovic, Goran
Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS
- Europe > Germany > Saarland > Saarbrücken (0.14)
- Europe > United Kingdom > England (0.04)
- Leisure & Entertainment > Games (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Banking & Finance > Trading (0.92)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Towards Enforcing Company Policy Adherence in Agentic Workflows
Zwerdling, Naama, Boaz, David, Rabinovich, Ella, Uziel, Guy, Amid, David, Anaby-Tavor, Ateret
Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $τ$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.
- Workflow (1.00)
- Research Report (1.00)
- Transportation > Passenger (0.47)
- Information Technology (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Aegis: Taxonomy and Optimizations for Overcoming Agent-Environment Failures in LLM Agents
Song, Kevin, Jayarajan, Anand, Ding, Yaoyao, Su, Qidong, Zhu, Zhanda, Liu, Sihang, Pekhimenko, Gennady
Large Language Models (LLMs) agents augmented with domain tools promise to autonomously execute complex tasks requiring human-level intelligence, such as customer service and digital assistance. However, their practical deployment is often limited by their low success rates under complex real-world environments. To tackle this, prior research has primarily focused on improving the agents themselves, such as developing strong agentic LLMs, while overlooking the role of the system environment in which the agent operates. In this paper, we study a complementary direction: improving agent success rates by optimizing the system environment in which the agent operates. We collect 142 agent traces (3,656 turns of agent-environment interactions) across 5 state-of-the-art agentic benchmarks. By analyzing these agent failures, we propose a taxonomy for agent-environment interaction failures that includes 6 failure modes. Guided by these findings, we design Aegis, a set of targeted environment optimizations: 1) environment observability enhancement, 2) common computation offloading, and 3) speculative agentic actions. These techniques improve agent success rates on average by 6.7-12.5%, without any modifications to the agent and underlying LLM.
- North America > Canada > Ontario > Toronto (0.88)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Health & Medicine (0.68)
- Leisure & Entertainment (0.46)