Exploring Speaker Diarization with Mixture of Experts
Yang, Gaobin, He, Maokui, Niu, Shutong, Wang, Ruoyu, Chen, Hang, Du, Jun
–arXiv.org Artificial Intelligence
--In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in spkeaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module, to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios. PEAKER diarization, which aims to determine the temporal boundaries of individual speakers within an audio stream and assign appropriate speaker identities, addresses the fundamental question of "who spoke when" [1]. It serves as a foundational component in numerous downstream speech-related tasks, including automatic meeting summarization, conversational analysis, and dialogue transcription [2].
arXiv.org Artificial Intelligence
Jun-18-2025
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Media (0.34)
- Leisure & Entertainment (0.34)
- Technology: