DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

Wang, Fang, Yan, Tianwei, Yang, Zonghao, Hu, Minghao, Zhang, Jun, Luo, Zhunchen, Bai, Xiaoying

arXiv.org Artificial Intelligence 

Entity linking is a fundamental task in knowledge graph (KG) construction Hofer et al. (2024), aiming to link mentions to their corresponding entities in a target knowledge base (KB). It is widely applied in downstream natural language processing (NLP) tasks, such as Question & Answering Systems Sequeda et al. (2024) and intelligent recommendation systems Chaudhari et al. (2017). Recently, the explosive growth of multimodal data on the Internet has raised challenges, as the quality of online information is often inconsistent, many mentions are ambiguous, and contextual information is frequently incomplete. Under such conditions, relying solely on a single modality (such as pure text) is often insufficient to accurately resolve reference ambiguity Gan et al. (2021). Integrating textual and visual modalities can significantly improve the precision and efficiency of disambiguation Gella et al. (2017). Consequently, multimodal entity linking, which involves combining textual and visual information to link real-world mentions to corresponding entities in a multimodal knowledge graph (MMKG), has become a critical research task. For example, as shown in Figure 1, the mention of "Apple" may be difficult to disambiguate, as it could refer to various entities, such as Apple Inc. or the apple (fruit). However, by considering both textual and visual information, it becomes easier and clearer to accurately link the mention of "Apple" to the entity "apple (fruit of the apple tree)." Currently, multimodal entity linking models are primarily based on deep learning frameworks, utilizing cross-attention mechanisms Lu and Elhamifar (2024) and visual feature encoding techniques Mokssit et al. (2023) to achieve the fusion of textual mentions and visual information.