Multi-Modality Multi-Loss Fusion Network

Wu, Zehui, Gong, Ziwei, Koo, Jaywon, Hirschberg, Julia

arXiv.org Artificial Intelligence 

We compare different methods for extracting The multimodal affective computing field has seen audio features as well as different fusion significant advances in feature extraction and multimodal network methods to combine audio and text signals fusion methodologies in recent years. By to identify the best-performing procedures. We find combining audio, text and visual signals, these that the addition of audio signals consistently improves models offer a more comprehensive, nuanced understanding performance and also that our transformer of human emotions. However, there fusion network further enhances results for most are still limitations: hand-crafted feature extraction metrics and achieves state-of-the-art results across algorithms often lack flexibility and generalization all datasets, indicating its efficacy in enhancing across diverse tasks. To overcome these limitations, cross-modality modeling and its potential for multimodal recent studies have proposed fully end-to-end models emotion detection. From multi-loss training, that optimize both feature extraction and learning we also observe that 1) using distinct labels for processes jointly (Dai et al., 2021). Our work each modality in multi-loss training significantly extracts feature representations from pre-trained benefits the models' performance, and 2) training models for different modalities and combines them on multimodal features improves not only the overall in an end-to-end manner, which provides a comprehensive model performance but also the model's accuracy and adaptable solution for multimodal on the single-modality subnet.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found