Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

Jan-23-2024–arXiv.org Artificial Intelligence

Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications. The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The E-SHARC approach uses front-end mel-filterbank features as input and jointly learns an embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on several benchmark datasets like AMI, VoxConverse and DISPLACE, illustrates that the proposed E-SHARC framework improves significantly over the state-of-art diarization systems.

dataset, diarization, speaker diarization, (14 more...)

arXiv.org Artificial Intelligence

Jan-23-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Maryland > Montgomery County > Bethesda (0.04)
- Europe
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - India > Karnataka
    - Bengaluru (0.04)

Genre:
- Research Report (0.50)

Industry:
- Media (0.48)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)