LanSER: Language-Model Supported Speech Emotion Recognition
Gong, Taesik, Belanich, Josh, Somandepalli, Krishna, Nagrani, Arsha, Eoff, Brian, Jou, Brendan
–arXiv.org Artificial Intelligence
Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.
arXiv.org Artificial Intelligence
Sep-7-2023
- Genre:
- Research Report (0.69)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Emotion (0.60)
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (0.53)
- Information Technology > Artificial Intelligence