SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
Wang, Helin, Hai, Jiarui, Yang, Dongchao, Chen, Chen, Li, Kai, Peng, Junyi, Thebaud, Thomas, Velazquez, Laureano Moro, Villalba, Jesus, Dehak, Najim
–arXiv.org Artificial Intelligence
Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.
arXiv.org Artificial Intelligence
Sep-9-2025
- Country:
- Africa > Central African Republic
- Ombella-M'Poko > Bimbo (0.04)
- Asia
- Europe
- North America
- Canada > Alberta
- United States
- California > San Francisco County
- San Francisco (0.14)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Rhode Island (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- California > San Francisco County
- Oceania > Australia
- Queensland > Brisbane (0.04)
- Africa > Central African Republic
- Genre:
- Research Report (1.00)
- Technology: