AITopics | switchboard

Collaborating Authors

switchboard

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DRES: Benchmarking LLMs for Disfluency Removal

Teleki, Maria, Janjur, Sai, Liu, Haoran, Grabner, Oliver, Verma, Ketan, Docog, Thomas, Dong, Xiangjue, Shi, Lingfeng, Wang, Cong, Birkelbach, Stephanie, Kim, Jason, Zhang, Yin, Caverlee, James

arXiv.org Artificial IntelligenceSep-25-2025

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.20321

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Arora, Siddhant, Tian, Jinchuan, Futami, Hayato, Jung, Jee-weon, Shi, Jiatong, Kashiwagi, Yosuke, Tsunoo, Emiru, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-3-2025

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition (ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.00722

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Kim, Haechan, Myung, Junho, Kim, Seoyoung, Lee, Sungpah, Kang, Dongyeop, Kim, Juho

arXiv.org Artificial IntelligenceJul-5-2024

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

dataset, learnervoice, speech, (14 more...)

arXiv.org Artificial Intelligence

2407.0428

Country:

North America > United States > Minnesota (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Qian, Livia, Skantze, Gabriel

arXiv.org Artificial IntelligenceJun-11-2024

Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

feedback function, feedback response, proc, (15 more...)

arXiv.org Artificial Intelligence

2406.07291

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Attribution and Alignment: Effects of Local Context Repetition on Utterance Production and Comprehension in Dialogue

Molnar, Aron, Jumelet, Jaap, Giulianelli, Mario, Sinclair, Arabella

arXiv.org Artificial IntelligenceNov-21-2023

While excessive levels of repetition, Human production in dialogue is influenced by designed to mimic alignment, can hinder naturalness many factors within the recent conversational (Isard et al., 2006; Foster et al., 2009), humans history, leading speakers to repeat recently used generally prefer generated dialogue that contains lexical and structural elements of their own higher levels of alignment (Lopes et al., 2015; and their partners' language. These factors can Hu et al., 2016), which also lead to more successful involve conceptual pacts speakers make in order communication in human-human dialogue (Xi to establish common ground (Brennan and Clark, et al., 2021; Isard et al., 2006). Moreover, elements 1996), priming of lexical or syntactic cues which of alignment have been successfully incorporated influences their subsequent re-use (Bock, 1986), in chat bots (Hoegen et al., 2019; Gao et al., 2019).

dialogue, repetition, utterance, (16 more...)

arXiv.org Artificial Intelligence

2311.13061

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > West Virginia (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(8 more...)

Genre:

Research Report > Experimental Study (0.70)
Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses

Passali, T., Mavropoulos, T., Tsoumakas, G., Meditskos, G., Vrochidis, S.

arXiv.org Artificial IntelligenceNov-16-2022

Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic context-aware artificial disfluencies. Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data. Our empirical evaluation demonstrates that LARD can indeed be effectively used when no or only a few data are available. Furthermore, our detailed analysis suggests that the proposed method generates realistic disfluencies and increases the accuracy of existing disfluency detectors.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2211.09235

Country:

Europe > Greece > Central Macedonia > Thessaloniki (0.06)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > North Macedonia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Why Computers Will Never Write Good Novels - Issue 95: Escape

NautilusFeb-11-2021, 13:30:05 GMT

The hoax seems harmless enough. A few thousand AI researchers have claimed that computers can read and write literature. They've alleged that algorithms can unearth the secret formulas of fiction and film. That Bayesian software can map the plots of memoirs and comic books. That digital brains can pen primitive lyrics1 and short stories--wooden and weird, to be sure, yet evidence that computers are capable of more. But the hoax is not harmless. If it were possible to build a digital novelist or poetry analyst, then computers would be far more powerful than they are now. They would in fact be the most powerful beings in the history of Earth. Their power would be the power of literature, which although it seems now, in today's glittering silicon age, to be a rather unimpressive old thing, springs from the same neural root that enables human brains to create, to imagine, to dream up tomorrows.

aristotle, computer, literature, (16 more...)

Nautilus

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
South America > Colombia (0.04)
North America > United States > Ohio (0.04)
(7 more...)

Industry:

Leisure & Entertainment (1.00)
Media (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Games > Go (0.40)

Add feedback

Speech Is More Than Spoken Text

#artificialintelligenceOct-11-2020, 09:55:06 GMT

Beyond these virtual assistants, voice technology and conversational AI have increased in popularity over the last decade and are used in many applications. One use of Natural Language Processing (NLP) technology is to analyse and gain insight from the written transcripts of audio-- whether from voice assistants or from other scenarios like meetings, interviews, call centres, lectures or TV shows. Yet when we speak, things are more complicated than a simple text transcription suggests. This post talks about some of the differences between written and spoken language, especially in the context of conversation. To understand conversation, we need data.

artificial intelligence, natural language, speech recognition, (17 more...)

#artificialintelligence

Country: North America > United States (0.15)

Industry: Government (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.50)

Add feedback

CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency

An, Keyu, Xiang, Hongyu, Ou, Zhijian

arXiv.org Machine LearningAug-4-2020

In this paper, we present a new open source toolkit for speech recognition, named CAT (CTC-CRF based ASR Toolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks. Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency. Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation. We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.

artificial intelligence, ctc-crf, machine learning, (19 more...)

arXiv.org Machine Learning

2005.13326

Country: Asia > China (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)

Add feedback

em The Vast of Night /em Is Like a UFO Movie Directed by a Very Talented Alien

SlateMay-27-2020, 02:05:31 GMT

Orson Welles, who knew a thing or two about making movies, reportedly remarked after touring the RKO lot that it was "the biggest electric train set any boy ever had." And yet it is rare to see a feature film that communicates any of that delight, any of the sheer fun of playing around with the possibilities the medium offers. The Vast of Night, the debut feature from director Andrew Patterson and screenwriters James Montague and Craig W. Sanger, arriving on Amazon Prime on May 29, is one of the exceptions: Every scene has been staged and shot with intelligence, intent, inventiveness, and a sense of play. To watch it is to get excited about the billions of different ways you can combine sound and moving images to tell a story. That is not to say that you'll necessarily be astounded by the story The Vast of Night is telling.

artificial intelligence, science fiction, ufo movie directed, (11 more...)

Slate

Country: North America > United States > New Mexico (0.05)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Science Fiction (0.40)

Add feedback