Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Zhou, Ao, Gu, Zebo, Sun, Tenghao, Chen, Jiawen, Tu, Mingsheng, Cheng, Zifeng, Yin, Yafeng, Jiang, Zhiwei, Gu, Qing

Aug-25-2025–arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.

large language model, machine learning, question answering, (15 more...)

arXiv.org Artificial Intelligence

Aug-25-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Chongqing Province > Chongqing (0.05)
    - Jiangsu Province > Nanjing (0.06)
  - Middle East > Jordan (0.04)
- Europe > Ireland
  - Leinster > County Dublin > Dublin (0.05)
- North America > United States
  - New York > New York County > New York City (0.04)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Education (0.51)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Question Answering (0.87)