MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
–arXiv.org Artificial Intelligence
Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Mul-timodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.
arXiv.org Artificial Intelligence
Jul-29-2025
- Country:
- Europe
- Italy > Sicily (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- North America > United States (0.04)
- Europe
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Automobiles & Trucks (1.00)
- Transportation > Ground
- Road (0.68)
- Technology: