AITopics | talker

Collaborating Authors

talker

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Neural Information Processing SystemsMar-22-2026, 04:38:13 GMT

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Genre: Research Report (0.39)

Technology:

Information Technology > Artificial Intelligence > Speech (0.59)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

Speech Separation for Hearing-Impaired Children in the Classroom

Olalere, Feyisayo, van der Heijden, Kiki, Stronks, H. Christiaan, Briaire, Jeroen, Frijns, Johan H. M., Güçlütürk, Yagmur

arXiv.org Artificial IntelligenceNov-12-2025

The process includes simulating room and listener acoustic properties (A), modeling talkers' movement trajectories (B), and synthesizing classroom speech mixtures (C). The numbers (1) - (5) correspond to the steps itemized in section II-B more challenging and reflective of classroom acoustics. The separation model is trained to output time-domain waveforms for each speaker with no interference from the other speaker or background noise. This setup enables the model to not only separate overlapping speech, but also to preserve spatial distinctions associated with each moving source. B. Simulation of Overlapping Speech for Classroom Conditions To capture the reverberant and spatial characteristics typical of classroom environments, we developed a spatialization pipeline for generating training and evaluation data (see Fig.1). This pipeline consists of five main components, which are explained below in detail: 1) Simulation of room impulse responses (RIRs) 2) Application of head-related impulse responses (HRIRs) 3) Generation of binaural room impulse responses (BRIRs) 4) Modeling of talkers' movement trajectories 5) Synthesis of the classroom speech data 1) Room Impulse Responses: To simulate naturalistic reverberant classroom acoustics, we generated RIRs that capture direct sound, early reflections, and reverberation or echo. These RIRs were used to spatialize source signals in simulated classroom environments with varying geometry, reverberation, and source-listener distances. We used the Pyroomacoustics Python package [35], which implements the image source method to model sound propagation in rectangular (shoebox) rooms. A total of 30 classrooms were simulated, with dimensions randomly sampled from a range of 8.5 8.5 3 m to 10 10 3.5 m (length width height), reflecting typical U.S. classroom sizes [36], [37].

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.07677

Country: Europe > Netherlands (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Education > Educational Setting (1.00)
Health & Medicine > Therapeutic Area > Otolaryngology (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Mittal, Manan, Deppisch, Thomas, Forrer, Joseph, Sueur, Chris Le, Ben-Hur, Zamir, Alon, David Lou, Wong, Daniel D. E.

arXiv.org Machine LearningSep-26-2025

We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

binaural signal, enhancement, microphone array, (12 more...)

arXiv.org Machine Learning

2509.13548

Country: North America > United States > New York > Suffolk County > Stony Brook (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.46)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.34)

Add feedback

Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

Shi, Hao, Fujita, Yusuke, Mizumoto, Tomoya, Liu, Lianbo, Kojima, Atsushi, Sudo, Yui

arXiv.org Artificial IntelligenceSep-8-2025

Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.04488

Country: Asia > Japan (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

Jiang, Xilin, Dindar, Sukru Samet, Choudhari, Vishal, Bickel, Stephan, Mehta, Ashesh, McKhann, Guy M, Friedman, Daniel, Flinker, Adeen, Mesgarani, Nima

arXiv.org Artificial IntelligenceMar-14-2025

However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perceptionaligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering Figure 1: AAD-LLM is a brain-computer interface in multitalker scenarios, with both objective (BCI) for auditory scene understanding. It decodes neural and subjective ratings showing improved alignment signals to identify the attended speaker and integrates with listener intention. By taking a first this information into a language model, generating responses step toward intention-aware auditory AI, this that align with the listener's perceptual focus.

aad-llm, dataset, speech, (15 more...)

arXiv.org Artificial Intelligence

2502.16794

Country:

North America > United States > New York (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > Virginia (0.04)
North America > United States > Michigan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leveraging Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes

Olalere, Feyisayo, van der Heijden, Kiki, Stronks, Christiaan H., Briaire, Jeroen, Frijns, Johan HM, van Gerven, Marcel

arXiv.org Artificial IntelligenceJan-24-2025

Speech separation approaches for single-channel, dry speech mixtures have significantly improved. However, real-world spatial and reverberant acoustic environments remain challenging, limiting the effectiveness of these approaches for assistive hearing devices like cochlear implants (CIs). To address this, we quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently. We analyze performance based on implicit spatial cues (inherent in the acoustic input and learned by the model) and explicit spatial cues (manually calculated spatial features added as auxiliary inputs). Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated and nearby talkers. Furthermore, spatial cues enhance separation when spectral cues are ambiguous, such as when voices are similar. Explicit spatial cues are particularly beneficial when implicit spatial cues are weak. For instance, single CI microphone recordings provide weaker implicit spatial cues than bilateral CIs, but even single CIs benefit from explicit cues. These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios. Additionally, our statistical analyses offer insights into how data properties influence model performance, supporting the development of efficient speech separation approaches for CIs and other assistive devices in real-world settings.

artificial intelligence, machine learning, spatial cue, (17 more...)

arXiv.org Artificial Intelligence

2501.1461

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States (0.04)
Europe > Netherlands > South Holland > Delft (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Consumer Health (0.72)
Health & Medicine > Therapeutic Area > Otolaryngology (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

Christakopoulou, Konstantina, Mourad, Shibl, Matarić, Maja

arXiv.org Artificial IntelligenceOct-10-2024

Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi-step reasoning and planning can be seen as analogous to the human systems of "thinking fast and slow" as introduced by Kahneman [14]. Our approach is comprised of a "Talker" agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a "Reasoner" agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi-step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker-Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real-world relevance.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.08328

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments

Ledder, Wessel, Qin, Yuzhen, van der Heijden, Kiki

arXiv.org Artificial IntelligenceSep-16-2024

Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio-driven DRL for tasks such as navigation, gaze control and head-orientation control in the context of human-robot interaction have received little attention. Here, we propose an audio-driven DRL framework in which we utilise deep Q-learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent's performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio-driven DRL for tasks such as head-orientation control and highlights the need for training strategies that enable robust generalization across environments for real-world audio-driven DRL applications.

acoustic environment, agent, orientation deviation, (13 more...)

arXiv.org Artificial Intelligence

2409.10048

Country:

Europe > Netherlands > Gelderland > Nijmegen (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)

Genre: Research Report > New Finding (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Add feedback

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Meng, Lingwei, Hu, Shujie, Kang, Jiawen, Li, Zhaoqing, Wang, Yuejiao, Wu, Wenxuan, Wu, Xixin, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceSep-13-2024

Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.

speech, speech recognition, talker, (13 more...)

arXiv.org Artificial Intelligence

2409.08596

Country:

Asia > China > Hong Kong (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Survey on biomarkers in human vocalizations

Härmä, Aki, Brinker, Bert den, Grossekathofer, Ulf, Ouweltjes, Okke, Nallanthighal, Srikanth, Abrol, Sidharth, Sharma, Vibhu

arXiv.org Artificial IntelligenceAug-8-2024

Recent years has witnessed an increase in technologies that use speech for the sensing of the health of the talker. This survey paper proposes a general taxonomy of the technologies and a broad overview of current progress and challenges. Vocal biomarkers are often secondary measures that are approximating a signal of another sensor or identifying an underlying mental, cognitive, or physiological state. Their measurement involve disturbances and uncertainties that may be considered as noise sources and the biomarkers are coarsely qualified in terms of the various sources of noise involved in their determination. While in some proposed biomarkers the error levels seem high, there are vocal biomarkers where the errors are expected to be low and thus are more likely to qualify as candidates for adoption in healthcare applications.

biomarker, speech, vocalization, (16 more...)

arXiv.org Artificial Intelligence

2407.17505

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Wisconsin (0.04)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
(8 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
(4 more...)

Add feedback