VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

Lim, Qi Zhi, Lee, Chin Poo, Lim, Kian Ming, Anbananthen, Kalaiarasi Sonai Muthu

arXiv.org Artificial Intelligence 

--The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion (e.g., image-to-text), and inadequate alignment between visual and textual representations. T o address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. T o enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model's capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a mul-timodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. These results highlight VLMT's strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems. The exponential growth of information in today's digital ecosystem has led to the proliferation of multimodal data--comprising text, tables, and images--across a wide range of platforms. Qi Zhi Lim is with the Faculty of Information Science and Technology, Multimedia University, Jalan A yer Keroh Lama, 75450 Melaka, Malaysia (email: 1181103589@student.mmu.edu.my). Chin Poo Lee is with the School of Computer Science, University of Nottingham Ningbo China, 199 Taikang East Road, Yinzhou District, Ningbo, Zhejiang Province, 315100, China (e-mail: leechinpoo@outlook.com). Kian Ming Lim is with the School of Computer Science, University of Nottingham Ningbo China, 199 Taikang East Road, Yinzhou District, Ningbo, Zhejiang Province, 315100, China (e-mail: Kian-Ming.Lim@nottingham.edu.cn). Multimodal Multi-hop Question Answering (MMQA) [1], [2] has emerged as a representative task in this domain, reflecting real-world information-seeking behavior where relevant evidence is scattered across multiple sources and modalities. MMQA requires models to perform two interdependent operations: retrieving relevant multimodal context and reasoning over the retrieved information to produce accurate and coherent answers. Early solutions to MMQA have largely followed modular paradigms.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found