Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Wang, Yueqian, Liang, Jianxin, Wang, Yuxuan, Zhang, Huishuai, Zhao, Dongyan

arXiv.org Artificial Intelligence 

Beijing Institute for Wangxuan Institute of Computer Technology, Peking University General Artificial Intelligence National Key Laboratory of General Artificial Intelligence wangyuxuan1@bigai.ai Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametricfree representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Multimodal Large Language Models (MLLMs) have been rapidly advancing in recent days Dai et al. (2023); Liu et al. (2023c;b); Zhang et al. (2023); Dong et al. (2024); Bai et al. (2023).