EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast

Chandra, Shreeram Suresh, Goncalves, Lucas, Lu, Junchen, Busso, Carlos, Sisman, Berrak

May-30-2025–arXiv.org Artificial Intelligence

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by na ıvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-30-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.47)

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found