AITopics | disfluency detection

Collaborating Authors

disfluency detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification

Nguyen, Ba-Quang

arXiv.org Artificial IntelligenceOct-14-2025

We propose a novel neural architecture named TextGraphFuseGAT, which integrates a pretrained transformer encoder (PhoBERT) with Graph Attention Networks for token-level classification tasks. The proposed model constructs a fully connected graph over the token embeddings produced by PhoBERT, enabling the GAT layer to capture rich inter-token dependencies beyond those modeled by sequential context alone. To further enhance contextualization, a Transformer-style self-attention layer is applied on top of the graph-enhanced embeddings. The final token representations are passed through a classification head to perform sequence labeling. We evaluate our approach on three Vietnamese benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19 domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER dataset, featuring 18 entity types collected from real-world medical speech transcripts and annotated with the BIO tagging scheme. Its specialized vocabulary and domain-specific expressions make it a challenging benchmark for token-level classification models. Experimental results show that our method consistently outperforms strong baselines, including transformer-only and hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness of combining pretrained semantic features with graph-based relational modeling for improved token classification across multiple domains.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.11537

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

DRES: Benchmarking LLMs for Disfluency Removal

Teleki, Maria, Janjur, Sai, Liu, Haoran, Grabner, Oliver, Verma, Ketan, Docog, Thomas, Dong, Xiangjue, Shi, Lingfeng, Wang, Cong, Birkelbach, Stephanie, Kim, Jason, Zhang, Yin, Caverlee, James

arXiv.org Artificial IntelligenceSep-25-2025

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.20321

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

Teleki, Maria, Janjur, Sai, Liu, Haoran, Grabner, Oliver, Verma, Ketan, Docog, Thomas, Dong, Xiangjue, Shi, Lingfeng, Wang, Cong, Birkelbach, Stephanie, Kim, Jason, Zhang, Yin, Caverlee, James

arXiv.org Artificial IntelligenceSep-25-2025

Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.20319

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts

Altinok, Duygu

arXiv.org Artificial IntelligenceJun-24-2025

Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.1851

Country: Europe (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Augmenting Automatic Speech Recognition Models with Disfluency Detection

Amann, Robin, Li, Zhaolin, Bruno, Barbara, Niehues, Jan

arXiv.org Artificial IntelligenceSep-17-2024

Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62% and an F1-score of 80.07%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks.

alignment, disfluency, transcript, (15 more...)

arXiv.org Artificial Intelligence

2409.10177

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication

Ahmad, Arif, Khyathi, Mothika Gayathri, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceJul-10-2024

Reduplication and repetition, though similar in form, serve distinct linguistic purposes. Reduplication is a deliberate morphological process used to express grammatical, semantic, or pragmatic nuances, while repetition is often unintentional and indicative of disfluency. This paper presents the first large-scale study of reduplication and repetition in speech using computational linguistics. We introduce IndicRedRep, a new publicly available dataset containing Hindi, Telugu, and Marathi text annotated with reduplication and repetition at the word level. We evaluate transformer-based models for multi-class reduplication and repetition token classification, utilizing the Reparandum-Interregnum-Repair structure to distinguish between the two phenomena. Our models achieve macro F1 scores of up to 85.62% in Hindi, 83.95% in Telugu, and 84.82% in Marathi for reduplication-repetition classification.

computational linguistic, reduplication, repetition, (13 more...)

arXiv.org Artificial Intelligence

2407.08147

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > India > Jharkhand (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Mohapatra, Payal, Likhite, Shamika, Biswas, Subrata, Islam, Bashima, Zhu, Qi

arXiv.org Artificial IntelligenceJun-11-2024

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7% even when video modality is missing in half of the samples.

detection, disfluency detection, modality, (16 more...)

arXiv.org Artificial Intelligence

2406.06964

Country:

North America > United States (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Boosting Disfluency Detection with Large Language Model as Disfluency Generator

Cheng, Zhenrong, Guo, Jiayan, Sun, Hao, Zhang, Yan

arXiv.org Artificial IntelligenceMar-13-2024

Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness.

detection, disfluency detection, disfluent sentence, (13 more...)

arXiv.org Artificial Intelligence

2403.08229

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Adapting the NICT-JLE Corpus for Disfluency Detection Models

Skidmore, Lucy, Moore, Roger K.

arXiv.org Artificial IntelligenceAug-4-2023

The detection of disfluencies such as hesitations, repetitions and false starts commonly found in speech is a widely studied area of research. With a standardised process for evaluation using the Switchboard Corpus, model performance can be easily compared across approaches. This is not the case for disfluency detection research on learner speech, however, where such datasets have restricted access policies, making comparison and subsequent development of improved models more challenging. To address this issue, this paper describes the adaptation of the NICT-JLE corpus, containing approximately 300 hours of English learners' oral proficiency tests, to a format that is suitable for disfluency detection model training and evaluation. Points of difference between the NICT-JLE and Switchboard corpora are explored, followed by a detailed overview of adaptations to the tag set and meta-features of the NICT-JLE corpus. The result of this work provides a standardised train, heldout and test set for use in future research on disfluency detection for learner speech.

corpus, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2308.02482

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > South Yorkshire > Sheffield (0.05)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Adversarial Training For Low-Resource Disfluency Correction

Bhat, Vineet, Jyothi, Preethi, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceJun-10-2023

Disfluencies commonly occur in conversational speech. Speech with disfluencies can result in noisy Automatic Speech Recognition (ASR) transcripts, which affects downstream tasks like machine translation. In this paper, we propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) that utilizes a small amount of labeled real disfluent data in conjunction with a large amount of unlabeled data. We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages- Bengali, Hindi, and Marathi (all from the Indo-Aryan family). Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments. We achieve an average 6.15 points improvement in F1-score over competitive baselines across all three languages mentioned. To the best of our knowledge, we are the first to utilize adversarial training for DC and use it to correct stuttering disfluencies in English, establishing a new benchmark for this task.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.06384

Country:

Asia > India (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback