Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Jalal, Md Asif, Remaggi, Luca, Moschopoulos, Vasileios, Kotsiopoulos, Thanasis, Rajan, Vandana, Saravanan, Karthikeyan, Drosou, Anastasis, Heo, Junho, Oh, Hyuk, Jeong, Seokyeong

Aug-11-2025–arXiv.org Artificial Intelligence

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOT A baseline, achieving 71% relative improvement in DER and 69% in cpWER.

artificial intelligence, machine learning, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

Aug-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (0.69)
  - Machine Learning > Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found