MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Liu, Ziyu, Zang, Yuhang, Dong, Xiaoyi, Zhang, Pan, Cao, Yuhang, Duan, Haodong, He, Conghui, Xiong, Yuanjun, Lin, Dahua, Wang, Jiaqi

Oct-23-2024–arXiv.org Artificial Intelligence

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attentionaware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images. Recent progress in Large Vision Language Models (LVLMs) marks a significant breakthrough in AI research. While proprietary models (e.g., GPT-4o (OpenAI, 2024)) excel at handling multiimage contexts, current open-source LVLMs (Liu et al., 2024b;a) yield promising results but are primarily focused on single-image visual question answering. In real-world environments, such as digital documents and web pages, multiple figures and texts are interleaved to convey complex information effectively. The ability to understand multi-image contexts is a crucial direction for the future development of LVLMs. LVLMs typically have three stages: (1) Pre-Training, (2) Supervised Fine-Tuning (SFT), and (3) Preference Alignment (i.e., Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) or from AI Feedback (RLAIF) (Bai et al., 2022)). Pre-training and SFT on multi-image data can enhance the model's ability to handle multiple images to some extent. Nevertheless, similar to single-image scenarios, hallucinations remain an inevitable issue.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-23-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment > Sports (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Vision (1.00)