AITopics | speech extraction

Collaborating Authors

speech extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Wang, Helin, Hai, Jiarui, Yang, Dongchao, Chen, Chen, Li, Kai, Peng, Junyi, Thebaud, Thomas, Velazquez, Laureano Moro, Villalba, Jesus, Dehak, Najim

arXiv.org Artificial IntelligenceSep-9-2025

Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.19314

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Vienna (0.14)
Europe > Greece (0.04)
(20 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Neural Speech Extraction with Human Feedback

Itani, Malek, Graves, Ashton, Eskimez, Sefik Emre, Gollakota, Shyamnath

arXiv.org Artificial IntelligenceAug-6-2025

We present the first neural target speech extraction (TSE) system that uses human feedback for iterative refinement. Our approach allows users to mark specific segments of the TSE output, generating an edit mask. The refinement system then improves the marked sections while preserving unmarked regions. Since large-scale datasets of human-marked errors are difficult to collect, we generate synthetic datasets using various automated masking functions and train models on each. Evaluations show that models trained with noise power-based masking (in dBFS) and probabilistic thresholding perform best, aligning with human annotations. In a study with 22 participants, users showed a preference for refined outputs over baseline TSE. Our findings demonstrate that human-in-the-loop refinement is a promising approach for improving the performance of neural speech extraction.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.03041

Country: North America > United States (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction

Kim, Minsu, Mira, Rodrigo, Chen, Honglie, Petridis, Stavros, Pantic, Maja

arXiv.org Artificial IntelligenceMar-11-2025

In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on https://miraodasilva.github.io/cse-project-page .

speech, speech extraction, speech separation, (11 more...)

arXiv.org Artificial Intelligence

2503.08798

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)

Add feedback

Analysis of impact of emotions on target speech extraction and speech separation

Švec, Ján, Žmolíková, Kateřina, Kocour, Martin, Delcroix, Marc, Ochiai, Tsubasa, Mošner, Ladislav, Černocký, Jan

arXiv.org Artificial IntelligenceAug-15-2022

Recently, the performance of blind speech separation (BSS) and target speech extraction (TSE) has greatly progressed. Most works, however, focus on relatively well-controlled conditions using, e.g., read speech. The performance may degrade in more realistic situations. One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realistic speech. In this paper, we investigate the influence of emotions on TSE and BSS. We create a new test dataset of emotional mixtures for the evaluation of TSE and BSS. This dataset combines LibriSpeech and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Through controlled experiments, we can analyze the impact of different emotions on the performance of BSS and TSE. We observe that BSS is relatively robust to emotions, while TSE, which requires identifying and extracting the speech of a target speaker, is much more sensitive to emotions. On comparative speaker verification experiments we show that identifying the target speaker may be particularly challenging when dealing with emotional speech. Using our findings, we outline potential future directions that could improve the robustness of BSS and TSE systems toward emotional speech.

emotion, speech separation, target speaker, (16 more...)

arXiv.org Artificial Intelligence

2208.07091

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Japan (0.04)

Genre:

Research Report > New Finding (0.66)
Research Report > Strength High (0.55)
Research Report > Experimental Study (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.51)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.36)

Add feedback

ConceptBeam: Concept Driven Target Speech Extraction

Ohishi, Yasunori, Delcroix, Marc, Ochiai, Tsubasa, Araki, Shoko, Takeuchi, Daiki, Niizumi, Daisuke, Kimura, Akisato, Harada, Noboru, Kashino, Kunio

arXiv.org Artificial IntelligenceJul-25-2022

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3503161.3548397

2207.11964

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)
Europe > Portugal > Lisbon > Lisbon (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Industry:

Education (0.46)
Information Technology (0.30)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback