multimodal entity
MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking
Ramamoorthy, Sathyanarayanan, Shah, Vishwa, Khanuja, Simran, Sheikh, Zaid, Jie, Shan, Chia, Ann, Chua, Shearman, Neubig, Graham
This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin
- Europe > United Kingdom (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India (0.05)
- (19 more...)
- Information Technology (0.68)
- Government (0.68)
- Leisure & Entertainment > Sports (0.46)
- Health & Medicine > Therapeutic Area > Immunology (0.46)
PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
Pooja, KM, Long, Cheng, Sun, Aixin
Abstract--The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. T o fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods. The last few decades have seen unprecedented growth in data availability. However, the increasing data availability quickly becomes a liability rather than an asset due to the increased gap between data and information. Thus, information extraction (IE) techniques to retrieve knowledge/information from a large amount of data have received considerable attention recently. A knowledge graph (KG) is a structured information database that allows storing extracted information from a large amount of data for retrieval or reasoning at a later stage. Furthermore, the recent developments in IE techniques allow the automatic creation of large KGs with millions of entries from knowledge bases such as Wikipedia, DBpedia, Freebase, and Y AGO [1]. Automated KG construction is a complex task that involves various intricate subtasks, including (i) named entity recognition to identify and categorize named entities, like a person or geographic locations, etc., in text, (ii) co-reference resolution to group references of the same entity, (iii) relation extraction to establish relationships between the entities, and (iv) entity linking [2], [3]. KM Pooja is with the Department of Information Technology, Indian Institute of Information Technology, Allahabad India 211012. Cheng Long and Aixin Sun are with the School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore 639798.
DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking
Wang, Fang, Yan, Tianwei, Yang, Zonghao, Hu, Minghao, Zhang, Jun, Luo, Zhunchen, Bai, Xiaoying
Entity linking is a fundamental task in knowledge graph (KG) construction Hofer et al. (2024), aiming to link mentions to their corresponding entities in a target knowledge base (KB). It is widely applied in downstream natural language processing (NLP) tasks, such as Question & Answering Systems Sequeda et al. (2024) and intelligent recommendation systems Chaudhari et al. (2017). Recently, the explosive growth of multimodal data on the Internet has raised challenges, as the quality of online information is often inconsistent, many mentions are ambiguous, and contextual information is frequently incomplete. Under such conditions, relying solely on a single modality (such as pure text) is often insufficient to accurately resolve reference ambiguity Gan et al. (2021). Integrating textual and visual modalities can significantly improve the precision and efficiency of disambiguation Gella et al. (2017). Consequently, multimodal entity linking, which involves combining textual and visual information to link real-world mentions to corresponding entities in a multimodal knowledge graph (MMKG), has become a critical research task. For example, as shown in Figure 1, the mention of "Apple" may be difficult to disambiguate, as it could refer to various entities, such as Apple Inc. or the apple (fruit). However, by considering both textual and visual information, it becomes easier and clearer to accurately link the mention of "Apple" to the entity "apple (fruit of the apple tree)." Currently, multimodal entity linking models are primarily based on deep learning frameworks, utilizing cross-attention mechanisms Lu and Elhamifar (2024) and visual feature encoding techniques Mokssit et al. (2023) to achieve the fusion of textual mentions and visual information.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (31 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Multi-level Mixture of Experts for Multimodal Entity Linking
Hu, Zhiwei, Gutiérrez-Basulto, Víctor, Xiang, Zhiliang, Li, Ru, Pan, Jeff Z.
Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.
- Asia > China > Shanxi Province > Taiyuan (0.04)
- Europe > United Kingdom > Wales > Cardiff (0.04)
- Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.86)
KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking
Kim, Juyeon, Lee, Geon, Kim, Taeuk, Shin, Kijung
Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.
- Europe > Italy (0.05)
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation
Nguyen, Cong-Duy, Wu, Xiaobao, Nguyen, Thong, Zhao, Shuai, Le, Khoi, Nguyen, Viet-Anh, Yichao, Feng, Luu, Anh Tuan
Previous research on multimodal entity linking (MEL) has primarily employed contrastive learning as the primary objective. However, using the rest of the batch as negative samples without careful consideration, these studies risk leveraging easy features and potentially overlook essential details that make entities unique. In this work, we propose JD-CCL (Jaccard Distance-based Conditional Contrastive Learning), a novel approach designed to enhance the ability to match multimodal entity linking models. JD-CCL leverages meta-information to select negative samples with similar attributes, making the linking task more challenging and robust. Additionally, to address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Contextual Visual-aid Controllable Patch Transform). It enhances visual representations by incorporating multi-view synthetic images and contextual textual representations to scale and shift patch representations. Experimental results on benchmark MEL datasets demonstrate the strong effectiveness of our approach.
- North America > Canada (0.14)
- Europe > Estonia (0.05)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (8 more...)
- Research Report > New Finding (0.93)
- Research Report > Promising Solution (0.68)
VP-MEL: Visual Prompts Guided Multimodal Entity Linking
Mi, Hongze, Li, Jinyuan, Zhang, Xuying, Cheng, Haoran, Wang, Jiahao, Sun, Di, Pan, Gang
Multimodal entity linking (MEL), a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB), has attracted much attention due to its wide applications in recent years. However, existing MEL methods often rely heavily on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. This reliance poses significant challenges in scenarios where mention words are absent, as current MEL approaches struggle to leverage image-text pairs for accurate entity linking. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual prompt) in an image to its corresponding entities in the knowledge base. To facilitate this task, we present a new dataset, VPWiki, specifically designed for VP-MEL. Furthermore, we propose a framework named FBMEL, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information. Experimental results on the VPWiki dataset demonstrate that FBMEL outperforms baseline methods across multiple benchmarks for the VP-MEL task.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (5 more...)
Multi-level Matching Network for Multimodal Entity Linking
Hu, Zhiwei, Gutiérrez-Basulto, Víctor, Li, Ru, Pan, Jeff Z.
Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (26 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Data Science > Data Mining (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models
Qi, Liu, Yongyi, He, Defu, Lian, Zhi, Zheng, Tong, Xu, Che, Liu, Enhong, Chen
Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at https://github.com/Javkonline/UniMEL.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Idaho > Ada County > Boise (0.05)
- Asia > China > Anhui Province > Hefei (0.05)
- (20 more...)
- Government (0.68)
- Leisure & Entertainment (0.46)
DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model
Song, Shezheng, Li, Shasha, Yu, Jie, Zhao, Shan, Li, Xiaopeng, Ma, Jun, Liu, Xiaodong, Li, Zhuo, Mao, Xiaoguang
Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+).
- North America > United States > Pennsylvania > Lackawanna County > Scranton (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
- (2 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Media > Music (0.68)