EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition
Paul, Durjoy Chandra, Saha, Gaurob, Hossain, Md Amjad
–arXiv.org Artificial Intelligence
Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78\% and unweighted accuracy of 92.52\% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75\% and unweighted accuracy of 91.28\%. On the RAVDESS dataset, we get a weighted accuracy of 94.53\% and 94.98\% unweighted accuracy for ReLU activation and 93.72\% weighted accuracy and 94.64\% unweighted accuracy for ELU activation. These results highlight EmoAugNet's effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.
arXiv.org Artificial Intelligence
Aug-11-2025
- Country:
- Asia > Bangladesh
- Sylhet Division > Sylhet District > Sylhet (0.04)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.04)
- Asia > Bangladesh
- Genre:
- Research Report (0.50)
- Technology: