intent detection
Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding
Wu, Di, Jiang, Liting, Fang, Ruiyu, Bianjing, null, Xie, Hongyan, Su, Haoxiang, Huang, Hao, He, Zhongjiang, Song, Shuangyong, Li, Xuelong
Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.
- North America > United States > Pennsylvania (0.04)
- Asia > China > Beijing > Beijing (0.04)
ReactEMG: Stable, Low-Latency Intent Detection from sEMG via Masked Modeling
Wang, Runsheng, Zhu, Xinyue, Chen, Ava, Xu, Jingxi, Winterbottom, Lauren, Nilsen, Dawn M., Stein, Joel, Ciocarlie, Matei
Surface electromyography (sEMG) signals show promise for effective human-machine interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly to user intent, produce stable flicker-free output suitable for device control, and work across different subjects without time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. We cast intent detection as per-timestep segmentation of continuous sEMG streams, assigning labels as gestures unfold in real time. We introduce a masked modeling training strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, using metrics that capture accuracy, latency and stability for device control, our approach achieves state-of-the-art performance in zero-shot conditions. These results demonstrate its potential for wearable robotics and next-generation prosthetic systems. Our project website, video, code, and dataset are available at: https://reactemg.github.io/
- North America > United States (0.04)
- Asia > China (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.49)
REIC: RAG-Enhanced Intent Classification at Scale
Zhang, Ziji, Yang, Michael, Chen, Zhiyu, Zhuang, Yingying, Pi, Shu-Ting, Liu, Qun, Maragoud, Rajashekar, Nguyen, Vy, Beniwal, Anurag
Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction
Mullick, Ankan, Purkayastha, Sukannya, Sharma, Saransh, Goyal, Pawan, Ganguly, Niloy
Voice-controlled dialog systems have become immensely popular due to their ability to perform a wide range of actions in response to diverse user queries. These agents possess a predefined set of skills or intents to fulfill specific user tasks. But every system has its own limitations. There are instances where, even for known intents, if any model exhibits low confidence, it results in rejection of utterances that necessitate manual annotation. Additionally, as time progresses, there may be a need to retrain these agents with new intents from the system-rejected queries to carry out additional tasks. Labeling all these emerging intents and rejected utterances over time is impractical, thus calling for an efficient mechanism to reduce annotation costs. In this paper, we introduce IDALC (Intent Detection and Active Learning based Correction), a semi-supervised framework designed to detect user intents and rectify system-rejected utterances while minimizing the need for human annotation. Empirical findings on various benchmark datasets demonstrate that our system surpasses baseline methods, achieving a 5-10% higher accuracy and a 4-8% improvement in macro-F1. Remarkably, we maintain the overall annotation cost at just 6-10% of the unlabelled data available to the system. The overall framework of IDALC is shown in Fig. 1
- Asia > India > West Bengal > Kharagpur (0.04)
- North America > United States (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (4 more...)
- Information Technology (0.46)
- Education (0.46)
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Mullick, Ankan, Sharma, Saransh, Jana, Abhik, Goyal, Pawan
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
DROID: Dual Representation for Out-of-Scope Intent Detection
Rashwan, Wael, Zawbaa, Hossam M., Dutta, Sourav, Assem, Haytham
Abstract--Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders--the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. T o enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6-15% for known and 8-20% for OOS intents, with the largest gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems. ONVERSA TIONAL AI systems are a primary interface for user assistance across sectors such as customer service, healthcare, and finance. A core requirement is intent classification--mapping utterances to predefined intents so downstream components can act appropriately [1].
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (10 more...)
Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Zhao, Shiman, Li, Shangyuan, Chen, Wei, Wang, Tengjiao, Yao, Jiahui, Zheng, Jiabin, Wong, Kam Fai
Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
Hybrid Dialogue State Tracking for Persian Chatbots: A Language Model-Based Approach
Aghabagher, Samin Mahdipour, Momtazi, Saeedeh
Dialogue State Tracking (DST) is an essential element of conversational AI with the objective of deeply understanding the conversation context and leading it toward answering user requests. Due to high demands for open-domain and multi-turn chatbots, the traditional rule-based DST is not efficient enough, since it cannot provide the required adaptability and coherence for human-like experiences in complex conversations. This study proposes a hybrid DST model that utilizes rule-based methods along with language models, including BERT for slot filling and intent detection, XGBoost for intent validation, GPT for DST, and online agents for real-time answer generation. This model is uniquely designed to be evaluated on a comprehensive Persian multi-turn dialogue dataset and demonstrated significantly improved accuracy and coherence over existing methods in Persian-based chatbots. The results demonstrate how effectively a hybrid approach may improve DST capabilities, paving the way for conversational AI systems that are more customized, adaptable, and human-like.
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (5 more...)
Few-Shot Query Intent Detection via Relation-Aware Prompt Learning
Zhang, Liang, Li, Yuan, Zhang, Shijie, Zhang, Zheng, Li, Xitong
Abstract--Intent detection is a crucial component of modern conversational systems, since accurately identifying user intent at the beginning of a conversation is essential for generating effective responses. Currently, most of the intent detectors can only work effectively with the assumption that high-quality labeled data is available (i.e., the collected data is labeled by domain experts). T o ease this process, recent efforts have focused on studying this problem under a more challenging few-shot scenario. These approaches primarily leverage large-scale unlabeled dialogue text corpora to pretrain language models through various pretext tasks, followed by fine-tuning for intent detection with very limited annotations. Despite the improvements achieved, existing methods have predominantly focused on textual data, neglecting to effectively capture the crucial structural information inherent in conversational systems, such as the query-query relation and query-answer relation. Specifically, the query-query relation captures the semantic relevance between two queries within the same session, reflecting the user's refinement of her request, while the query-answer relation represents the conversational agent's clarification and response to a user query . T o address this gap, we propose SAID, a novel framework that integrates both textual and relational structure information in a unified manner for model pretraining for the first time. Firstly, we introduce a relation-aware prompt module, which employs learnable relation tokens as soft prompts, enabling the model to learn shared knowledge across multiple relations and become explicitly aware of how to interpret query text within the context of these relations. Secondly, we reformulate the few-shot intent detection problem using prompt learning by creating a new intent-specific relation-aware prompt, which incorporates intent-specific relation tokens alongside the semantic information embedded in intent names, helping the pretrained model effectively transfer the pretrained knowledge acquired from related relational perspectives. Building on this framework, we further propose a novel mechanism, the query-adaptive attention network (QueryAdapt), which operates at the relation token level by generating intent-specific relation tokens from well-learned query-query and query-answer relations explicitly, enabling more fine-grained knowledge transfer. Extensive experimental results on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods, achieving improvements of up to 27% in the 3-shot setting. When equipped with the relation token-level QueryAdapt module, it yields additional performance gains of up to 21% in the same setting.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- (2 more...)
AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
Xie, Yan, Cui, Yibo, Xie, Liang, Yin, Erwei
ABSTRACT Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embed-dings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Tianjin Province > Tianjin (0.05)
- Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)