Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Liang, Lei, Zhang, Wen, Chen, Huajun

arXiv.org Artificial Intelligence 

Effectively reasoning about abstractive image inputs poses an elevated challenge for MLLMs, as it demands not only basic object recognition but also a deeper understanding and interpretation of the complex information encapsulated within these human-defined abstractive visual forms. Among the diverse array of abstractive images, an important area remains underexplored: ST ructured and A bstractive R easoning (ST AR) on images with M ulti-M odal R elational K nowledge (MMRK). As illustrated in Figure 1, MMRK consists of multiple multi-modal entities and concepts that are interconnected by abstract relational edges, representing well-organized and structured factual knowledge. Unlike natural or other abstractive images, MMRK offers a flexible and structured format for encoding complex semantic relations, with broad application potential (An et al., 2025). The relational links act as higher-order human-defined abstractions, modeling intricate connections among entities, and thus place greater demands on MLLM's reasoning capabilities. To accurately perform ST AR, MLLMs must understand both the entities and the underlying relational structure. However, ST AR remains largely unaddressed, with only a few studies (Zhang et al., 2024a; 2025d) briefly investigating this capability, which still face two critical challenges: (i) Lack of large-scale data synthesis method for ST AR. From the data perspective, there is a shortage of high-quality MMRK images and corresponding multi-modal instruction data. Automated pipelines for generating diverse and scalable MMRK datasets are missing, along with reliable chain-of-thought (CoT) reasoning annotations needed to improve MLLM's complex thinking and generalization ability.