GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Pan, Yu, Hu, Yanni, Yang, Yuguang, Fei, Wen, Yao, Jixun, Lu, Heng, Ma, Lei, Zhao, Jianjun

Dec-4-2023–arXiv.org Artificial Intelligence

Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods.

emo-clap, emotion recognition, recognition, (16 more...)

arXiv.org Artificial Intelligence

Dec-4-2023

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Alberta (0.14)
- Asia
  - Japan
    - Kyūshū & Okinawa > Kyūshū (0.05)
    - Honshū > Kantō
      - Tokyo Metropolis Prefecture > Tokyo (0.14)
  - China > Shanghai
    - Shanghai (0.05)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Cognitive Science > Emotion (0.65)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)