Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Dec-24-2025, 23:42:38 GMT–Neural Information Processing Systems

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents.

fine-grained late-interaction multi-modal retrieval, name change, ra-vqa, (8 more...)

Neural Information Processing Systems

Dec-24-2025, 23:42:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.82)
  - Knowledge Management > Knowledge Engineering (0.59)
  - Artificial Intelligence
    - Natural Language (0.65)
    - Representation & Reasoning > Expert Systems (0.59)