Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Wang, Hsuan-Yu, Lee, Pei-Ying, Chen, Berlin

Jul-28-2025–arXiv.org Artificial Intelligence

--In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. T o address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diariza-tion models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERT a with audio embeddings from Wav2V ec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. Speech Emotion Recognition (SER) has gained substantial research attention, particularly for its applications in human-computer interaction.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Jul-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Taiwan (0.15)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)
  - Cognitive Science > Emotion (1.00)
  - Speech > Speech Recognition (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found