Goto

Collaborating Authors

 interruption


Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

Jacniacki, Mateusz, Serrat, Martí Carmona

arXiv.org Artificial Intelligence

Conversational agents built on large language models (LLMs) are becoming increasingly prevalent, yet most systems are designed for one-on-one, turn-based exchanges rather than natural, asynchronous group chats. As AI assistants become widespread throughout digital platforms, from virtual assistants to customer service, developing natural and humanlike interaction patterns seems crucial for maintaining user trust and engagement. We present the Humanlike Multi-user Agent (HUMA), an LLM-based facilitator that participates in multi-party conversations using human-like strategies and timing. HUMA extends prior multi-user chatbot work with an event-driven architecture that handles messages, replies, reactions and introduces realistic response-time simulation. HUMA comprises three components--Router, Action Agent, and Reflection--which together adapt LLMs to group conversation dynamics. We evaluate HUMA in a controlled study with 97 participants in four-person role-play chats, comparing AI and human community managers (CMs). Participants classified CMs as human at near-chance rates in both conditions, indicating they could not reliably distinguish HUMA agents from humans. Subjective experience was comparable across conditions: community-manager effectiveness, social presence, and engagement/satisfaction differed only modestly with small effect sizes. Our results suggest that, in natural group chat settings, an AI facilitator can match human quality while remaining difficult to identify as nonhuman.


Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning

Neural Information Processing Systems

In reinforcement learning, agents learn by performing actions and observing their outcomes. Sometimes, it is desirable for a human operator to interrupt an agent in order to prevent dangerous situations from happening. Yet, as part of their learning process, agents may link these interruptions, that impact their reward, to specific states and deliberately avoid them. The situation is particularly challenging in a multi-agent context because agents might not only learn from their own past interruptions, but also from those of other agents. Orseau and Armstrong defined safe interruptibility for one learner, but their work does not naturally extend to multi-agent systems. This paper introduces dynamic safe interruptibility, an alternative definition more suited to decentralized learning problems, and studies this notion in two learning frameworks: joint action learners and independent learners. We give realistic sufficient conditions on the learning algorithm to enable dynamic safe interruptibility in the case of joint action learners, yet show that these conditions are not sufficient for independent learners. We show however that if agents can detect interruptions, it is possible to prune the observations to ensure dynamic safe interruptibility even for independent learners.



D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

Chen, Sen, Zhao, Tong, Bin, Yi, Ma, Fei, Shao, Wenqi, Wang, Zheng

arXiv.org Artificial Intelligence

Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.


LLM Reinforcement in Context

Rivasseau, Thomas

arXiv.org Artificial Intelligence

LLM alignment techniques currently struggle with enforcing desired characteristics and harmlessness of outputs over long conversational contexts and chains-of-thought. In this paper we present the scaling problem, a mathematical formulation of this difficulty, and propose interruptions as a means to achieve LLM alignment in scaling contexts. We call this reinforcement in context. Paper structure is as follows: section 1 is this introduction and section 2 presents the scaling problem. In section 3 we describe interruptions as a means to solve the alignment scaling problem. In section 4 we discuss consequences and limitations and in section 5 we highlight avenues for future research.



A Full-duplex Speech Dialogue Scheme Based On Large Language Model

Neural Information Processing Systems

We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate in tandem, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.


MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Zhang, He, Cui, Wenqian, Xu, Haoning, Li, Xiaohui, Zhu, Lei, Ma, Shaohua, King, Irwin

arXiv.org Artificial Intelligence

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.


VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Liu, Xiaoyu, Fu, Chaoyou, Yan, Chi, Wu, Chu, Gao, Haihan, Zhang, Yi-Fan, Dong, Shaoqi, Qian, Cheng, Luo, Bin, Yang, Xiuyong, Li, Guanwu, Cai, Yusheng, Shen, Yunhang, Jiang, Deqiang, Cao, Haoyu, Sun, Xing, Shan, Caifeng, He, Ran

arXiv.org Artificial Intelligence

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VIT A-E, a novel human-robot interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an "Active Model" and a "Standby Model", allowing the robot to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and inter-ruptibly, mimicking human-like multitasking capabilities. We further propose a "model-as-controller" paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VIT A-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants. Achieving this level of seamless multimodal coordination is the defining aspiration for our ideal general-purpose robot. However, the predominant focus of the field has been on improving the success rate of specific, static tasks, often overlooking a critical dimension of autonomy: the ability to engage in continuous, natural, and dynamic collaboration with a human user in complex scenarios (Abbo et al., 2025; Fong et al., 2003). An ideal robotic assistant should not be a silent executor of commands but a collaborative partner, which encompasses maintaining continuous visual perception, processing auditory inputs, generating verbal responses, and executing physical actions in parallel (e.g., answering, "Is the bookshelf tidied up?" while organizing a room) and dynamically adapting to new directives that reflect a changing environment (e.g., "Don't clean the bedroom yet--the baby is sleeping."). Such concurrent multitasking and dynamic response is fundamental to enabling natural human-robot collaboration. 1 Please see our demo video at this Y ouT ube link.


SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Chiang, Cheng-Han, Wang, Xiaofei, Li, Linjie, Lin, Chung-Ching, Lin, Kevin, Liu, Shujie, Wang, Zhendong, Yang, Zhengyuan, Lee, Hung-yi, Wang, Lijuan

arXiv.org Artificial Intelligence

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/