AITopics | speaker extraction

Collaborating Authors

speaker extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

Yan, Haoyin, Liu, Chengwei, Xue, Shaofei, Liang, Xiaotao, Xue, Zheng

arXiv.org Artificial IntelligenceOct-24-2025

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.20441

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction

Si, Youhao, Liao, Yuan, Han, Qiushi, Yang, Yuhang, Dai, Rui, Huang, Liya

arXiv.org Artificial IntelligenceOct-15-2025

The rapid development of auditory attention decoding (AAD) based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction. However, how to effectively utilize the target-speaker common information between EEG and speech remains an unresolved problem. In this paper, we propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to effectively extract information from EEG signals, we derive multi-scale time--frequency features and further incorporate cortical topological structures that are selectively engaged during the task. Moreover, to effectively exploit the non-Euclidean structure of EEG signals and capture their global features, the graph convolutional networks and self-attention mechanism are used in the EEG encoder. In addition, to make full use of the fused EEG and speech feature and preserve global context and capture speech rhythm and prosody, we introduce MossFormer2 which combines MossFormer and RNN-Free Recurrent as separator. Experimental results on both the public Cocktail Party and KUL dataset in this paper show that our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics. The source code is available at: https://github.com/LaoDa-X/TFGA-NET.

artificial intelligence, machine learning, speech, (16 more...)

arXiv.org Artificial Intelligence

2510.12275

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Therapeutic Area (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Target speaker anonymization in multi-speaker recordings

Tomashenko, Natalia, Yamagishi, Junichi, Wang, Xin, Liu, Yun, Vincent, Emmanuel

arXiv.org Artificial IntelligenceOct-13-2025

Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer's voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.

artificial intelligence, machine learning, target speaker, (15 more...)

arXiv.org Artificial Intelligence

2510.09307

Country: Asia > Japan (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance

Kienegger, Jakob, Mannanova, Alina, Fang, Huajian, Gerkmann, Timo

arXiv.org Artificial IntelligenceJul-8-2025

Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.02791

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Pan, Zexu, Zhao, Shengkui, Wang, Tingting, Zhou, Kun, Ma, Yukun, Zhang, Chong, Ma, Bin

arXiv.org Artificial IntelligenceMay-28-2025

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the A V -DPRNN and the state-of-the-art A V -TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped V oxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and gen-eralizability of our method.

artificial intelligence, machine learning, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

2505.20635

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)

Add feedback

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

Kienegger, Jakob, Gerkmann, Timo

arXiv.org Artificial IntelligenceMay-21-2025

Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.

artificial intelligence, ieee icassp, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2505.14517

Country:

North America > United States (0.04)
Europe > Germany > Hamburg (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)

Add feedback

End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions

Eisenberg, Aviad, Gannot, Sharon, Chazan, Shlomo E.

arXiv.org Artificial IntelligenceFeb-10-2025

This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2502.06285

Country:

Asia > Middle East > Israel (0.05)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.97)
Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

De Silva, Dashanka, Cai, Siqi, Pahuja, Saurav, Schultz, Tanja, Li, Haizhou

arXiv.org Artificial IntelligenceSep-16-2024

In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.

adc block, speaker extraction, speech, (13 more...)

arXiv.org Artificial Intelligence

2409.02489

Country:

Europe > Germany > Bremen > Bremen (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.49)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Hao, Xiang, Wu, Jibin, Yu, Jianwei, Xu, Chenglin, Tan, Kay Chen

arXiv.org Artificial IntelligenceOct-14-2023

Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research. Demos are provided at https://github.com/haoxiangsnr/llm-tse Colin, 1953) - a term coined to describe a scenario where multiple sound sources are engaged in simultaneous conversation, yet a listener can selectively concentrate on a single sound source. This scenario represents a complex challenge in auditory perception (Haykin & Chen, 2005; Mesgarani & Chang, 2012; Bizley & Cohen, 2013) and serves as a remarkable demonstration of the intricate sound processing that occurs within the human auditory system.

ieee international conference, international conference, speech, (12 more...)

arXiv.org Artificial Intelligence

2310.07284

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Virginia (0.04)
(6 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Media (0.68)
Leisure & Entertainment (0.68)
Health & Medicine > Consumer Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Conditional Diffusion Model for Target Speaker Extraction

Nguyen, Theodor, Sun, Guangzhi, Zheng, Xianrui, Zhang, Chao, Woodland, Philip C

arXiv.org Artificial IntelligenceOct-7-2023

We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a specific speaker further improves performance, enabling personalisation in target speaker extraction.

proc, speaker extraction, target speaker, (11 more...)

arXiv.org Artificial Intelligence

2310.04791

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback