Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
–Neural Information Processing Systems
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages.
Neural Information Processing Systems
Jun-21-2026, 17:15:21 GMT
- Country:
- Europe (1.00)
- North America
- United States (0.68)
- Canada (0.46)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Technology: