Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

Tripathi, Suraj, Kumar, Abhay, Ramesh, Abhiram, Singh, Chirag, Yenigalla, Promod

arXiv.org Machine Learning 

This paper proposes a Residual Convolutional Neural Network (ResNet) based on speech features and trained under Focal Loss to recognize emotion in speech. Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCCs) have shown the ability to characterize emotion better than just plain text. Further Focal Loss, first used in One-Stage Object Detectors, has shown the ability to focus the training process more towards hard-examples and down-weight the loss assigned to well-classified examples, thus preventing the model from being overwhelmed by easily classifiable examples. After experimenting with several Deep Neural Network (DNN) architectures, we propose a ResNet, which takes in Spectrogram or MFCC as input and supervised by Focal Loss, ideal for speech inputs where there exists a large class imbalance. Maintaining continuity with previous work in this area, we have used the University of Southern California's Interactive Emotional Motion Capture (USC-IEMOCAP) database's Improvised Topics in this work. This dataset is ideal for our work, as there exists a significant class imbalance among the various emotions. Our best model achieved a 3.4% improvement in overall accuracy and a 2.8% improvement in class accuracy when compared to existing state-of-the-art methods.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found