Goto

Collaborating Authors

 payment method


STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases

Maddela, Mounica, Xie, Lingjue, Preotiuc-Pietro, Daniel, Mausam, null

arXiv.org Artificial Intelligence

Semantic parsing methods for converting text to SQL queries enable question answering over structured data and can greatly benefit analysts who routinely perform complex analytics on vast data stored in specialized relational databases. Although several benchmarks measure the abilities of text to SQL, the complexity of their questions is inherently limited by the level of expressiveness in query languages and none focus explicitly on questions involving complex analytical reasoning which require operations such as calculations over aggregate analytics, time series analysis or scenario understanding. In this paper, we introduce STARQA, the first public human-created dataset of complex analytical reasoning questions and answers on three specialized-domain databases. In addition to generating SQL directly using LLMs, we evaluate a novel approach (Text2SQLCode) that decomposes the task into a combination of SQL and Python: SQL is responsible for data fetching, and Python more naturally performs reasoning. Our results demonstrate that identifying and combining the abilities of SQL and Python is beneficial compared to using SQL alone, yet the dataset still remains quite challenging for the existing state-of-the-art LLMs.


MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Huang, Yizhe, Liu, Yang, Zhao, Ruiyu, Zhong, Xiaolong, Yue, Xingming, Jiang, Ling

arXiv.org Artificial Intelligence

Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios. Large Language Model-based agents (LLM-based agents) are increasingly adopted in large-scale customer service systems, where they act as interactive assistants for diverse users (Brown et al., 2020). Despite their rapid deployment, these agents face persistent challenges: they often lose critical information across sessions, repeat errors without systematic correction, and struggle to adapt to rapidly changing product catalogs. Such limitations undermine their reliability in dynamic environments such as e-commerce. Existing memory solutions typically rely on short-term caching or user-specific profiles (Chhikara et al., 2025; Zhong et al., 2023). Consequently, purely per-user or short-horizon memories are insufficient for robust long-term improvement.


MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Zhao, Weikang, Wang, Xili, Ma, Chengdi, Kong, Lingbin, Yang, Zhaohua, Tuo, Mingxiang, Shi, Xiaowei, Zhai, Yitao, Cai, Xunliang

arXiv.org Artificial Intelligence

With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent's tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates LLM-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent -- outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.


How I'd set up a Roku for a 90-year-old

PCWorld

A couple weeks ago, a reader asked me about the best streaming TV setup for a 90-year-old neighbor who is not tech-savvy. My mind immediately jumped to Roku, whose smart TVs and streaming players have always emphasized simplicity. But I also know that Roku's streaming platform has become more complicated in recent years, and its once-basic menu system is not what it used to be. While I'd still recommend Roku to someone who's on the lower end of the tech learning curve, our neighbor in this scenario would benefit from some out-of-the-box settings tweaks. Whether you're setting up a Roku for yourself of someone else, here's how to make the streamer as easy to use as possible: Roku is now requiring new users to put a payment method on file during setup.


$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Barres, Victor, Dong, Honghua, Ray, Soham, Si, Xujie, Narasimhan, Karthik

arXiv.org Artificial Intelligence

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.


The Art of Tool Interface Design

Wu, Yunnan, Chen, Paul, Baranwal, Deshank, Zhou, Jinlong, Yuan, Jian

arXiv.org Artificial Intelligence

We present an agentic framework, Thinker, which achieves state of art performance in challenging reasoning tasks for realistic customer service scenarios that involve complex business logic and human interactions via long horizons. On the $\tau$-bench retail dataset, Thinker achieves 82.6\% success rate with GPT-4o (version 2024-06-01) (baseline: 68.3\%), and 81.9\% success rate with Llama-3.1 405B (baseline: 49.6\%), without any fine-tuning. Thinker effectively closes the gap in reasoning capabilities between the base models by introducing proper structure. The key features of the Thinker framework are: (1) State-Machine Augmented Generation (SMAG), which represents business logic as state machines and the LLM uses state machines as tools. (2) Delegation of tasks from the main reasoning loop to LLM-powered tools. (3) Adaptive context management. Our prompting-only solution achieves signficant gains, while still maintaining a standard agentic architecture with a ReAct style reasoning loop. The key is to innovate on the tool interface design, as exemplified by SMAG and the LLM-powered tools.


$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, Shunyu, Shinn, Noah, Razavi, Pedram, Narasimhan, Karthik

arXiv.org Artificial Intelligence

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.


How To Create Your Own Auto-GPT AI Agent

#artificialintelligence

To get good output from ChatGPT or another LLM, you usually have to feed it several prompts. But what if you could just give your AI bot a set of fairly broad goals at the start of a session and then sit back while it generates its own set of tasks to fulfill those goals? That's the idea behind Auto-GPT, a new open-source tool that uses the OpenAI API (same LLM as ChatGPT) to prompt itself, based on your initial input. We've already seen a number of Twitter users talk about how they are using Auto-GPT for everything from creating marketing plans to analyzing market data for investments to preparing topics for a podcast. Based on our hands-on experience, we can't say that it always works well (we asked it to write a Windows 11 how-to and the result was awful), but it's early days and some tasks may work better than others.


The "Smart Payments" Roadmap: Using AI to Transform Payments Processes

#artificialintelligence

Smart Payments are the future of financial transactions and with the help of AI businesses can easily create these efficient and cost-effective payments, revolutionizing the way we pay for goods and services. The term "smart payment" refers to any payment system that uses advanced technology to make transactions more efficient, secure, and convenient. These systems use a variety of technologies, including mobile devices, biometrics, and APIs, to facilitate payments in a way that is faster, more secure, and more convenient than traditional payment methods. If you're unsure of where to begin, here are seven steps to create Smart Payments using AI. The first step in creating Smart Payments is to identify the business need. This includes understanding the current payment process and identifying any pain points or inefficiencies.


"Unlocking the Potential of Machine Translation Through Dataset Training, Validation, and…

#artificialintelligence

The coronavirus pandemic has changed the way we live, work, and interact with each other. We've all had to make adjustments to the way we do things, including the way we shop. We're now seeing a shift towards contactless and digital payments, which has made it easier for us to stay safe and healthy while still being able to purchase the items we need. Contactless payments have become increasingly popular during the pandemic and offer a range of benefits. Not only are they faster, more convenient, and more secure than traditional payment methods, but they also provide an extra layer of protection from the virus.