Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering
Choi, Changin, Lee, Wonseok, Ko, Jungmin, Rhee, Wonjong
–arXiv.org Artificial Intelligence
Recent advances in Multimodal Large Language Models~(MLLMs) have significantly enhanced the ability of these models in multimodal understanding and reasoning. However, the performance of MLLMs for knowledge-intensive visual questions, which require external knowledge beyond the visual content of an image, still remains limited. While Retrieval-Augmented Generation (RAG) has become a promising solution to provide models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and incorporates knowledge synthesis to refine its understanding. At each iteration, the model formulates a reasoning-guided multi-query to explore multiple facets of knowledge. Subsequently, these queries drive a joint search across heterogeneous knowledge bases, retrieving diverse knowledge. This retrieved knowledge is then synthesized to enrich the reasoning record, progressively deepening the model's understanding. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia
- Indonesia > Bali (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.04)
- South Korea
- Gyeonggi-do > Suwon (0.04)
- Seoul > Seoul (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- Italy > Calabria
- North America
- Dominican Republic (0.04)
- United States > Florida
- Miami-Dade County > Miami (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.67)
- Technology: