Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Jun-10-2026, 20:52:47 GMT–Neural Information Processing Systems

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements.

large language model, natural language, question answering, (6 more...)

Neural Information Processing Systems

Jun-10-2026, 20:52:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Question Answering (0.63)
  - Large Language Model (0.43)