MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Chatzichristodoulou, Georgios, Kosmopoulou, Despoina, Kritikos, Antonios, Poulopoulou, Anastasia, Georgiou, Efthymios, Katsamanis, Athanasios, Katsouros, Vassilis, Potamianos, Alexandros

Sep-5-2025–arXiv.org Artificial Intelligence

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Sep-5-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.98)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Cognitive Science > Emotion (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)