AITopics | speech emotion recognition

Collaborating Authors

speech emotion recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Wang, Cong, Geng, Yizhong, Wen, Yuhua, Li, Qifei, Gao, Yingming, Wang, Ruimin, Wang, Chunfeng, Li, Hao, Li, Ya, Chen, Wei

arXiv.org Artificial IntelligenceDec-5-2025

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.

artificial intelligence, emotion recognition, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2512.04551

Country:

Asia > China > Beijing > Beijing (0.05)
Europe > United Kingdom > England > Surrey > Guildford (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

Li, Xingfeng, Shi, Xiaohan, Li, Junjie, Li, Yongwei, Unoki, Masashi, Toda, Tomoki, Akagi, Masato

arXiv.org Artificial IntelligenceNov-26-2025

This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.

artificial intelligence, machine learning, recognition, (16 more...)

arXiv.org Artificial Intelligence

2511.20106

Country:

Asia > Macao (0.34)
Asia > China > Hong Kong (0.25)
Oceania > Australia > Australian Capital Territory > Canberra (0.05)
(8 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)

Add feedback

Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues

Nasr, Seham, Ren, Zhao, Johnson, David

arXiv.org Artificial IntelligenceNov-18-2025

Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies "what" is highlighted and connects it to "why" it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.

artificial intelligence, emotion, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.11691

Country:

Europe > Germany > Bremen > Bremen (0.14)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

Tsai, Yun-Shao, Lin, Yi-Cheng, Chou, Huang-Cheng, Lee, Hung-yi

arXiv.org Artificial IntelligenceNov-17-2025

Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.06071

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
Asia > Singapore (0.04)
(6 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.87)

Add feedback

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Onyekwelu-Udoka, Lucky, Islam, Md Shafiqul, Hasan, Md Shahedul

arXiv.org Artificial IntelligenceNov-4-2025

Abstract--Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT . Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices.

emotion recognition, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.00402

Country: North America > United States > Iowa (0.05)

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features

Talpur, Unzela, Syed, Zafi Sherhan, Syed, Muhammad Shehram Shah, Syed, Abbas Shah

arXiv.org Artificial IntelligenceNov-3-2025

Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.

artificial intelligence, dataset, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2510.26823

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.70)

Add feedback

Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Mitsis, Stavros, Hadjikyriakos, Ermos, Ibrahim, Humaid, Neofytou, Savvas, Raman, Shashwat, Myles, James, Kanjo, Eiman

arXiv.org Artificial IntelligenceOct-22-2025

Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

machine learning, real time system, recognition, (20 more...)

arXiv.org Artificial Intelligence

2510.18036

Country:

North America > United States > California (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.83)

Industry: Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Seong-Whan

arXiv.org Artificial IntelligenceOct-20-2025

Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (V AD). We propose EmoSphere-SER, a joint model that integrates spherical V AD region classification to guide V AD regression for improved emotion prediction. In our framework, V AD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1391

2505.19693

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition

Jon, Hyo Jin, Jin, Longbin, Jung, Hyuntaek, Kim, Hyunseo, Min, Donghun, Kim, Eun Yi

arXiv.org Artificial IntelligenceOct-15-2025

This paper presents our contributions to the Speech Emotion Recognition in Naturalistic Conditions (SERNC) Challenge, where we address categorical emotion recognition and emotional attribute prediction. To handle the complexities of natural speech, including intra- and inter-subject variability, we propose Multi-level Acoustic-Textual Emotion Representation (MATER), a novel hierarchical framework that integrates acoustic and textual features at the word, utterance, and embedding levels. By fusing low-level lexical and acoustic cues with high-level contextualized representations, MATER effectively captures both fine-grained prosodic variations and semantic nuances. Additionally, we introduce an uncertainty-aware ensemble strategy to mitigate annotator inconsistencies, improving robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and an average CCC of 0.5928, securing second place in valence prediction with an impressive CCC of 0.6941.

artificial intelligence, prediction, representation, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1041

2506.19887

Country:

Asia > South Korea (0.05)
Europe > United Kingdom > England > Surrey > Guildford (0.04)
Europe > Netherlands > South Holland > Rotterdam (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)

Add feedback

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

Tiwari, Upasana, Chakraborty, Rupayan, Kopparapu, Sunil Kumar

arXiv.org Artificial IntelligenceOct-13-2025

Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

artificial intelligence, emotion recognition, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2510.09072

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Indonesia > Bali (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)

Add feedback