OViP: Online Vision-Language Preference Learning for VLM Hallucination

Liu, Shujun, Wang, Siyuan, Li, Zejun, Wang, Jianxiang, Zeng, Cheng, Wei, Zhongyu

arXiv.org Artificial Intelligence 

Large vision-language models (L VLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. Although recent training-based approaches aim to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that do not reflect actual model errors, thus limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP not only reduces hallucinations while preserving core multi-modal capabilities, but also substantially improves training efficiency. However, L VLMs continue to struggle with persistent hallucination issues (Li et al., 2023b; Bai et al., 2024), often exhibiting incorrect references to visual content (Liu et al., 2024a; Zhou et al., 2023; Bai et al., 2024). These errors manifest as misattributing object properties, describing nonexistent entities, or fabricating spatial relationships that do not align with the image. Such inconsistencies undermine the model's faithfulness to the input and hinder further reasoning capabilities, significantly limiting the reliability of L VLMs in real-world applications. Recent success of Direct Preference Optimization (DPO) (Rafailov et al., 2023) in LLMs alignment has inspired the exploration of multi-modal DPO to mitigate hallucination in L VLMs (Y u et al., 2024a;b; Xie et al., 2024; Sarkar et al., 2024). However, early efforts directly extend the original DPO designs from LLMs to L VLMs by constructing preference pairs solely on textual responses given the same image input, primarily focusing on response-side preference optimization and showing limited effectiveness.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found