Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Liang, Lei, Zhang, Wen, Chen, Huajun

Oct-28-2025–arXiv.org Artificial Intelligence

Effectively reasoning about abstractive image inputs poses an elevated challenge for MLLMs, as it demands not only basic object recognition but also a deeper understanding and interpretation of the complex information encapsulated within these human-defined abstractive visual forms. Among the diverse array of abstractive images, an important area remains underexplored: ST ructured and A bstractive R easoning (ST AR) on images with M ulti-M odal R elational K nowledge (MMRK). As illustrated in Figure 1, MMRK consists of multiple multi-modal entities and concepts that are interconnected by abstract relational edges, representing well-organized and structured factual knowledge. Unlike natural or other abstractive images, MMRK offers a flexible and structured format for encoding complex semantic relations, with broad application potential (An et al., 2025). The relational links act as higher-order human-defined abstractions, modeling intricate connections among entities, and thus place greater demands on MLLM's reasoning capabilities. To accurately perform ST AR, MLLMs must understand both the entities and the underlying relational structure. However, ST AR remains largely unaddressed, with only a few studies (Zhang et al., 2024a; 2025d) briefly investigating this capability, which still face two critical challenges: (i) Lack of large-scale data synthesis method for ST AR. From the data perspective, there is a shortage of high-quality MMRK images and corresponding multi-modal instruction data. Automated pipelines for generating diverse and scalable MMRK datasets are missing, along with reliable chain-of-thought (CoT) reasoning annotations needed to improve MLLM's complex thinking and generalization ability.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Vision (0.87)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.68)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found