VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Sun, Yubo, Peng, Chunyi, Yan, Yukun, Yu, Shi, Liu, Zhenghao, Chen, Chi, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial Intelligence 

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Y et current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective. All codes are available at https://github.com/OpenBMB/VisRAG. Retrieval-Augmented Generation (RAG) equips Large Language Models (LLMs) with a knowledge retriever that accesses a curated external knowledge base, supplying task-relevant context at generation time and mitigating hallucinations arising from insufficient parametric knowledge (Lewis et al., 2020; Asai et al., 2024). However, ineffective use of retrieved information limits practical adoption in domain-specific tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found