Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Zhu, Tinghui, Liu, Qin, Wang, Fei, Tu, Zhengzhong, Chen, Muhao

Oct-11-2024–arXiv.org Artificial Intelligence

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of cross-modality parametric knowledge conflict and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%. Large Vision-Language Models (LVLMs; OpenAI 2023; Anil et al. 2023; Liu et al. 2024) have demonstrated potent capabilities for perceiving and understanding information across different modalities. These models typically consist of a visual encoder and a large language model (LLM), aligned by a projection layer (Li et al., 2022a; Alayrac et al., 2022; Liu et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-11-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.68)
- Europe (0.93)
- North America > United States
  - California (0.68)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)