Goto

Collaborating Authors

 switchboard


DRES: Benchmarking LLMs for Disfluency Removal

Teleki, Maria, Janjur, Sai, Liu, Haoran, Grabner, Oliver, Verma, Ketan, Docog, Thomas, Dong, Xiangjue, Shi, Lingfeng, Wang, Cong, Birkelbach, Stephanie, Kim, Jason, Zhang, Yin, Caverlee, James

arXiv.org Artificial Intelligence

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.


Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Arora, Siddhant, Tian, Jinchuan, Futami, Hayato, Jung, Jee-weon, Shi, Jiatong, Kashiwagi, Yosuke, Tsunoo, Emiru, Watanabe, Shinji

arXiv.org Artificial Intelligence

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition (ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.


LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Kim, Haechan, Myung, Junho, Kim, Seoyoung, Lee, Sungpah, Kang, Dongyeop, Kim, Juho

arXiv.org Artificial Intelligence

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.


Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Qian, Livia, Skantze, Gabriel

arXiv.org Artificial Intelligence

Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.


Attribution and Alignment: Effects of Local Context Repetition on Utterance Production and Comprehension in Dialogue

Molnar, Aron, Jumelet, Jaap, Giulianelli, Mario, Sinclair, Arabella

arXiv.org Artificial Intelligence

While excessive levels of repetition, Human production in dialogue is influenced by designed to mimic alignment, can hinder naturalness many factors within the recent conversational (Isard et al., 2006; Foster et al., 2009), humans history, leading speakers to repeat recently used generally prefer generated dialogue that contains lexical and structural elements of their own higher levels of alignment (Lopes et al., 2015; and their partners' language. These factors can Hu et al., 2016), which also lead to more successful involve conceptual pacts speakers make in order communication in human-human dialogue (Xi to establish common ground (Brennan and Clark, et al., 2021; Isard et al., 2006). Moreover, elements 1996), priming of lexical or syntactic cues which of alignment have been successfully incorporated influences their subsequent re-use (Bock, 1986), in chat bots (Hoegen et al., 2019; Gao et al., 2019).


Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses

Passali, T., Mavropoulos, T., Tsoumakas, G., Meditskos, G., Vrochidis, S.

arXiv.org Artificial Intelligence

Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic context-aware artificial disfluencies. Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data. Our empirical evaluation demonstrates that LARD can indeed be effectively used when no or only a few data are available. Furthermore, our detailed analysis suggests that the proposed method generates realistic disfluencies and increases the accuracy of existing disfluency detectors.


Why Computers Will Never Write Good Novels - Issue 95: Escape

Nautilus

The hoax seems harmless enough. A few thousand AI researchers have claimed that computers can read and write literature. They've alleged that algorithms can unearth the secret formulas of fiction and film. That Bayesian software can map the plots of memoirs and comic books. That digital brains can pen primitive lyrics1 and short stories--wooden and weird, to be sure, yet evidence that computers are capable of more. But the hoax is not harmless. If it were possible to build a digital novelist or poetry analyst, then computers would be far more powerful than they are now. They would in fact be the most powerful beings in the history of Earth. Their power would be the power of literature, which although it seems now, in today's glittering silicon age, to be a rather unimpressive old thing, springs from the same neural root that enables human brains to create, to imagine, to dream up tomorrows.


Speech Is More Than Spoken Text

#artificialintelligence

Beyond these virtual assistants, voice technology and conversational AI have increased in popularity over the last decade and are used in many applications. One use of Natural Language Processing (NLP) technology is to analyse and gain insight from the written transcripts of audio-- whether from voice assistants or from other scenarios like meetings, interviews, call centres, lectures or TV shows. Yet when we speak, things are more complicated than a simple text transcription suggests. This post talks about some of the differences between written and spoken language, especially in the context of conversation. To understand conversation, we need data.


CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency

An, Keyu, Xiang, Hongyu, Ou, Zhijian

arXiv.org Machine Learning

In this paper, we present a new open source toolkit for speech recognition, named CAT (CTC-CRF based ASR Toolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks. Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency. Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation. We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.


em The Vast of Night /em Is Like a UFO Movie Directed by a Very Talented Alien

Slate

Orson Welles, who knew a thing or two about making movies, reportedly remarked after touring the RKO lot that it was "the biggest electric train set any boy ever had." And yet it is rare to see a feature film that communicates any of that delight, any of the sheer fun of playing around with the possibilities the medium offers. The Vast of Night, the debut feature from director Andrew Patterson and screenwriters James Montague and Craig W. Sanger, arriving on Amazon Prime on May 29, is one of the exceptions: Every scene has been staged and shot with intelligence, intent, inventiveness, and a sense of play. To watch it is to get excited about the billions of different ways you can combine sound and moving images to tell a story. That is not to say that you'll necessarily be astounded by the story The Vast of Night is telling.