Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025
Ferreira, Alef Iury Siqueira, Gris, Lucas Rafael, Filho, Alexandre Ferro, Ólives, Lucas, Ribeiro, Daniel, Fernando, Luiz, Lustosa, Fernanda, Tanaka, Rodrigo, de Oliveira, Frederico Santos, Filho, Arlindo Galvão
–arXiv.org Artificial Intelligence
Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.
arXiv.org Artificial Intelligence
Jun-4-2025
- Country:
- Asia > China (0.04)
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- South America > Brazil
- Mato Grosso (0.04)
- Rio Grande do Norte (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Emotion (1.00)
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Natural Language (1.00)
- Speech > Speech Recognition (0.68)
- Information Technology > Artificial Intelligence