Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Yu, Zeping, Ananiadou, Sophia

arXiv.org Artificial Intelligence 

Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multimodal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Large Language Models (LLMs) (Brown, 2020; Ouyang et al., 2022; Touvron et al., 2023) have achieved remarkable results in numerous downstream tasks (Xiao et al., 2023; Tan et al., 2023; Deng et al., 2023). However, the underlying mechanisms are not yet well understood. This lack of clarity poses a significant challenge for researchers attempting to address issues such as hallucination (Yao et al., 2023), toxicity (Gehman et al., 2020), and bias (Kotek et al., 2023) in LLMs. Therefore, understanding the mechanisms of LLMs has become an increasingly important area of research.