VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Sun, Yubo, Peng, Chunyi, Yan, Yukun, Yu, Shi, Liu, Zhenghao, Chen, Chi, Liu, Zhiyuan, Sun, Maosong

Oct-14-2025–arXiv.org Artificial Intelligence

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Y et current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective. All codes are available at https://github.com/OpenBMB/VisRAG. Retrieval-Augmented Generation (RAG) equips Large Language Models (LLMs) with a knowledge retriever that accesses a curated external knowledge base, supplying task-relevant context at generation time and mitigating hallucinations arising from insufficient parametric knowledge (Lewis et al., 2020; Asai et al., 2024). However, ineffective use of retrieved information limits practical adoption in domain-specific tasks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)
- Asia > China (0.28)
- Europe > Austria (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found