MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Ge, Haonan, Wang, Yiwei, Yang, Ming-Hsuan, Cai, Yujun
–arXiv.org Artificial Intelligence
Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Asia
- Middle East > Israel
- Tel Aviv District > Tel Aviv (0.04)
- Singapore (0.04)
- Middle East > Israel
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Switzerland (0.04)
- Belgium > Brussels-Capital Region
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States > California
- Los Angeles County > Long Beach (0.04)
- Mexico > Mexico City
- Oceania > Australia
- Queensland (0.04)
- Asia
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language
- Chatbot (0.46)
- Large Language Model (0.46)
- Vision (1.00)
- Information Technology > Artificial Intelligence