$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Barres, Victor, Dong, Honghua, Ray, Soham, Si, Xujie, Narasimhan, Karthik
–arXiv.org Artificial Intelligence
Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.
arXiv.org Artificial Intelligence
Jun-10-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > California
- Los Angeles County > Los Angeles (0.04)
- San Diego County > San Diego (0.04)
- Canada > Ontario
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Workflow (1.00)
- Industry:
- Banking & Finance (0.93)
- Consumer Products & Services > Travel (0.67)
- Transportation
- Technology: