MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Jul-29-2025–arXiv.org Artificial Intelligence

Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Mul-timodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.

information, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Jul-29-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Automobiles & Trucks (1.00)
- Transportation > Ground
  - Road (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.93)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found