talker
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model.
Speech Separation for Hearing-Impaired Children in the Classroom
Olalere, Feyisayo, van der Heijden, Kiki, Stronks, H. Christiaan, Briaire, Jeroen, Frijns, Johan H. M., Güçlütürk, Yagmur
The process includes simulating room and listener acoustic properties (A), modeling talkers' movement trajectories (B), and synthesizing classroom speech mixtures (C). The numbers (1) - (5) correspond to the steps itemized in section II-B more challenging and reflective of classroom acoustics. The separation model is trained to output time-domain waveforms for each speaker with no interference from the other speaker or background noise. This setup enables the model to not only separate overlapping speech, but also to preserve spatial distinctions associated with each moving source. B. Simulation of Overlapping Speech for Classroom Conditions To capture the reverberant and spatial characteristics typical of classroom environments, we developed a spatialization pipeline for generating training and evaluation data (see Fig.1). This pipeline consists of five main components, which are explained below in detail: 1) Simulation of room impulse responses (RIRs) 2) Application of head-related impulse responses (HRIRs) 3) Generation of binaural room impulse responses (BRIRs) 4) Modeling of talkers' movement trajectories 5) Synthesis of the classroom speech data 1) Room Impulse Responses: To simulate naturalistic reverberant classroom acoustics, we generated RIRs that capture direct sound, early reflections, and reverberation or echo. These RIRs were used to spatialize source signals in simulated classroom environments with varying geometry, reverberation, and source-listener distances. We used the Pyroomacoustics Python package [35], which implements the image source method to model sound propagation in rectangular (shoebox) rooms. A total of 30 classrooms were simulated, with dimensions randomly sampled from a range of 8.5 8.5 3 m to 10 10 3.5 m (length width height), reflecting typical U.S. classroom sizes [36], [37].
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- North America > United States (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Education > Educational Setting (1.00)
- Health & Medicine > Therapeutic Area > Otolaryngology (0.47)
Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers
Mittal, Manan, Deppisch, Thomas, Forrer, Joseph, Sueur, Chris Le, Ben-Hur, Zamir, Alon, David Lou, Wong, Daniel D. E.
We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.
Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition
Shi, Hao, Fujita, Yusuke, Mizumoto, Tomoya, Liu, Lianbo, Kojima, Atsushi, Sudo, Yui
Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
Jiang, Xilin, Dindar, Sukru Samet, Choudhari, Vishal, Bickel, Stephan, Mehta, Ashesh, McKhann, Guy M, Friedman, Daniel, Flinker, Adeen, Mesgarani, Nima
However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perceptionaligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering Figure 1: AAD-LLM is a brain-computer interface in multitalker scenarios, with both objective (BCI) for auditory scene understanding. It decodes neural and subjective ratings showing improved alignment signals to identify the attended speaker and integrates with listener intention. By taking a first this information into a language model, generating responses step toward intention-aware auditory AI, this that align with the listener's perceptual focus.
- North America > United States > New York (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Michigan (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.87)
Leveraging Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes
Olalere, Feyisayo, van der Heijden, Kiki, Stronks, Christiaan H., Briaire, Jeroen, Frijns, Johan HM, van Gerven, Marcel
Speech separation approaches for single-channel, dry speech mixtures have significantly improved. However, real-world spatial and reverberant acoustic environments remain challenging, limiting the effectiveness of these approaches for assistive hearing devices like cochlear implants (CIs). To address this, we quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently. We analyze performance based on implicit spatial cues (inherent in the acoustic input and learned by the model) and explicit spatial cues (manually calculated spatial features added as auxiliary inputs). Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated and nearby talkers. Furthermore, spatial cues enhance separation when spectral cues are ambiguous, such as when voices are similar. Explicit spatial cues are particularly beneficial when implicit spatial cues are weak. For instance, single CI microphone recordings provide weaker implicit spatial cues than bilateral CIs, but even single CIs benefit from explicit cues. These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios. Additionally, our statistical analyses offer insights into how data properties influence model performance, supporting the development of efficient speech separation approaches for CIs and other assistive devices in real-world settings.
- Europe > Netherlands > South Holland > Leiden (0.04)
- North America > United States (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- Health & Medicine > Consumer Health (0.72)
- Health & Medicine > Therapeutic Area > Otolaryngology (0.35)
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture
Christakopoulou, Konstantina, Mourad, Shibl, Matarić, Maja
Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi-step reasoning and planning can be seen as analogous to the human systems of "thinking fast and slow" as introduced by Kahneman [14]. Our approach is comprised of a "Talker" agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a "Reasoner" agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi-step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker-Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real-world relevance.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments
Ledder, Wessel, Qin, Yuzhen, van der Heijden, Kiki
Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio-driven DRL for tasks such as navigation, gaze control and head-orientation control in the context of human-robot interaction have received little attention. Here, we propose an audio-driven DRL framework in which we utilise deep Q-learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent's performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio-driven DRL for tasks such as head-orientation control and highlights the need for training strategies that enable robust generalization across environments for real-world audio-driven DRL applications.
- Europe > Netherlands > Gelderland > Nijmegen (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
Meng, Lingwei, Hu, Shujie, Kang, Jiawen, Li, Zhaoqing, Wang, Yuejiao, Wu, Wenxuan, Wu, Xixin, Liu, Xunying, Meng, Helen
Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.
- Asia > China > Hong Kong (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Survey on biomarkers in human vocalizations
Härmä, Aki, Brinker, Bert den, Grossekathofer, Ulf, Ouweltjes, Okke, Nallanthighal, Srikanth, Abrol, Sidharth, Sharma, Vibhu
Recent years has witnessed an increase in technologies that use speech for the sensing of the health of the talker. This survey paper proposes a general taxonomy of the technologies and a broad overview of current progress and challenges. Vocal biomarkers are often secondary measures that are approximating a signal of another sensor or identifying an underlying mental, cognitive, or physiological state. Their measurement involve disturbances and uncertainties that may be considered as noise sources and the biomarkers are coarsely qualified in terms of the various sources of noise involved in their determination. While in some proposed biomarkers the error levels seem high, there are vocal biomarkers where the errors are expected to be low and thus are more likely to qualify as candidates for adoption in healthcare applications.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Wisconsin (0.04)
- (12 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)