Bimodal Connection Attention Fusion for Speech Emotion Recognition

Luo, Jiachen, Phan, Huy, Wang, Lin, Reiss, Joshua D.

Mar-12-2025–arXiv.org Artificial Intelligence

Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Mar-12-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - France (0.14)
  - United Kingdom (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Emotion (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)