Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Li, Yuanchao, Bell, Peter, Lai, Catherine
–arXiv.org Artificial Intelligence
SER models built on such limited sized corpora don't generalize Alongside acoustic information, linguistic features based on well to out-of-domain speech. Second, while previous studies speech transcripts have been proven useful in Speech Emotion proposed using ASR to generate transcripts for SER [5], ASR Recognition (SER). However, due to the scarcity of on emotional speech can often result in relatively high error emotion labelled data and the difficulty of recognizing emotional rates. Previous research has shown that emotion in speech speech, it is hard to obtain reliable linguistic features degrades ASR performance, with emotional speech assumed and models in this research area. In this paper, we propose to be a distortion of neutral speech [6]. However, with the to fuse Automatic Speech Recognition (ASR) outputs into advancement of deep learning technologies, transfer learning the pipeline for joint training SER. The relationship between for SER from ASR and joint training of ASR and SER have ASR and SER is understudied, and it is unclear what and recently emerged [7, 8]. Nevertheless, the relationship between how ASR features benefit SER. By examining various ASR ASR and SER is still poorly studied, particularly what outputs and fusion methods, our experiments show that in and how ASR features can benefit SER.
arXiv.org Artificial Intelligence
Mar-17-2022
- Country:
- South America > Chile
- North America > United States
- Massachusetts (0.04)
- Europe > United Kingdom
- Scotland > City of Edinburgh > Edinburgh (0.04)
- Genre:
- Research Report (0.82)
- Technology: