AITopics | disfluency

Collaborating Authors

disfluency

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

Liu, Hongcheng, Hou, Yixuan, Liu, Heyang, Wang, Yuhao, Wang, Yanfeng, Wang, Yu

arXiv.org Artificial IntelligenceOct-20-2025

While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.15406

Country:

North America > United States (1.00)
Europe (1.00)
South America (0.93)
(3 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.68)
Government > Regional Government > North America Government > United States Government (0.46)
Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (0.34)
Health & Medicine > Therapeutic Area > Musculoskeletal (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

DRES: Benchmarking LLMs for Disfluency Removal

Teleki, Maria, Janjur, Sai, Liu, Haoran, Grabner, Oliver, Verma, Ketan, Docog, Thomas, Dong, Xiangjue, Shi, Lingfeng, Wang, Cong, Birkelbach, Stephanie, Kim, Jason, Zhang, Yin, Caverlee, James

arXiv.org Artificial IntelligenceSep-25-2025

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.20321

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation

Devanathan, Rishikesh, Nathan, Varun, Kumar, Ayush

arXiv.org Artificial IntelligenceAug-26-2025

Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.1821

Country: North America > United States (0.92)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.88)

Add feedback

DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

Chavda, Anshul, Jagadeesh, M, Kullayappa, Chintalapalli Raja, Jayaprakash, B, Sruthi, Medchalimi, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceJul-29-2025

In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET's human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.19867

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Colorado (0.28)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Automobiles & Trucks (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Data Augmentation for Spoken Grammatical Error Correction

Karanasou, Penny, Qian, Mengjie, Bannò, Stefano, Gales, Mark J. F., Knill, Kate M.

arXiv.org Artificial IntelligenceJul-28-2025

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.

artificial intelligence, data quality, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.19374

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.29)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.86)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Data Science > Data Quality > Data Cleaning (0.64)

Add feedback

Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Lin, Jhen-Ke, Lu, Hao-Chien, Wang, Chung-Chun, Lin, Hong-Yun, Chen, Berlin

arXiv.org Artificial IntelligenceJul-28-2025

Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.04076

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts

Altinok, Duygu

arXiv.org Artificial IntelligenceJun-24-2025

Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.1851

Country: Europe (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Does the Appearance of Autonomous Conversational Robots Affect User Spoken Behaviors in Real-World Conference Interactions?

Pang, Zi Haur, Fu, Yahui, Lala, Divesh, Elmers, Mikey, Inoue, Koji, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceMar-17-2025

We investigate the impact of robot appearance on users' spoken behavior during real-world interactions by comparing a human-like android, ERICA, with a less anthropomorphic humanoid, TELECO. Analyzing data from 42 participants at SIGDIAL 2024, we extracted linguistic features such as disfluencies and syntactic complexity from conversation transcripts. The results showed moderate effect sizes, suggesting that participants produced fewer disfluencies and employed more complex syntax when interacting with ERICA. Further analysis involving training classification models like Na\"ive Bayes, which achieved an F1-score of 71.60\%, and conducting feature importance analysis, highlighted the significant role of disfluencies and syntactic complexity in interactions with robots of varying human-like appearances. Discussing these findings within the frameworks of cognitive load and Communication Accommodation Theory, we conclude that designing robots to elicit more structured and fluent user speech can enhance their communicative alignment with humans.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.13625

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.07)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.05)
Asia > Middle East > Jordan (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.68)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.48)
(2 more...)

Add feedback

Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion

Hassan, Syed Zohaib, Lison, Pierre, Halvorsen, Pål

arXiv.org Artificial IntelligenceDec-17-2024

Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.

large language model, machine learning, speaker 1, (22 more...)

arXiv.org Artificial Intelligence

2412.1271

Country:

Europe > Norway > Eastern Norway > Oslo (0.05)
North America > United States > Mississippi > Mississippi County > Mississippi State (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre:

Research Report > Experimental Study (0.94)
Questionnaire & Opinion Survey (0.90)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Analysis and Detection of Differences in Spoken User Behaviors between Autonomous and Wizard-of-Oz Systems

Elmers, Mikey, Inoue, Koji, Lala, Divesh, Ochi, Keiko, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceOct-4-2024

This study examined users' behavioral differences in a large corpus of Japanese human-robot interactions, comparing interactions between a tele-operated robot and an autonomous dialogue system. We analyzed user spoken behaviors in both attentive listening and job interview dialogue scenarios. Results revealed significant differences in metrics such as speech length, speaking rate, fillers, backchannels, disfluencies, and laughter between operator-controlled and autonomous conditions. Furthermore, we developed predictive models to distinguish between operator and autonomous system conditions. Our models demonstrated higher accuracy and precision compared to the baseline model, with several models also achieving a higher F1 score than the baseline.

backchannel, job interview scenario, scenario, (13 more...)

arXiv.org Artificial Intelligence

2410.03147

Country:

North America > United States > Pennsylvania (0.04)
Asia > Singapore (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre: Research Report > Experimental Study (0.88)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback