MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Ge, Haonan, Wang, Yiwei, Yang, Ming-Hsuan, Cai, Yujun
–arXiv.org Artificial Intelligence
Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- North America > United States (1.00)
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Machine Learning (1.00)
- Natural Language
- Large Language Model (0.46)
- Chatbot (0.46)
- Information Technology > Artificial Intelligence