Speech Emotion Recognition via Entropy-Aware Score Selection

Chua, ChenYi, Wong, JunKai, Chen, Chengxin, Miao, Xiaoxiao

arXiv.org Artificial Intelligence 

--In this paper, we propose a multimodal framework for speech emotion recognition that leverages entropy-aware score selection to combine speech and textual predictions. The proposed method integrates a primary pipeline that consists of an acoustic model based on wav2vec2.0 We propose a late score fusion approach based on entropy and varentropy thresholds to overcome the confidence constraints of primary pipeline predictions. Speech Emotion Recognition (SER), which aims to recognise emotions directly from voice inputs as discrete emotion classes [1], has become a crucial area of study in human-computer interaction, enhancing the emotional intelligence of virtual assistants, interactive robots, and mental health monitoring systems [2]. The rapid development of deep SER models, such as Convolutional Neural Networks (CNNs) [3], Recurrent Neural Networks (RNNs) [4], and Transformer-based architectures [5], [6], [7], has substantially improved recognition accuracy by capturing complex temporal and contextual patterns in speech.