Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach

Zhao, Taoxu, Li, Meisi, Chen, Kehao, Wang, Liye, Zhou, Xucheng, Chaturvedi, Kunal, Prasad, Mukesh, Anaissi, Ali, Braytee, Ali

Mar-10-2025–arXiv.org Artificial Intelligence

Multimodal sentiment analysis enhances conventional sentiment analysis, which traditionally relies solely on text, by incorporating information from different modalities such as images, text, and audio. This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments. For text feature extraction, we utilize BERT, a natural language processing model. For image feature extraction, we employ DINOv2, a vision-transformer-based model. The textual and visual latent features are integrated using proposed fusion techniques, namely the Basic Fusion Model, Self-Attention Fusion Model, and Dual-Attention Fusion Model. Experiments on three datasets--the Memotion 7k dataset, MVSA-single dataset, and MVSA-multi dataset--demonstrate the viability and practicality of the proposed multimodal architecture.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Mar-10-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Spain (0.14)
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Discourse & Dialogue (1.00)
    - Information Extraction (1.00)
  - Representation & Reasoning > Information Fusion (1.00)