From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Ju, Yeong-Joon, Lee, Seong-Whan

arXiv.org Artificial Intelligence 

--Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. T o overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time. Embeddings [1], [2] are fixed-dimensional representations of inputs, encoded as semantic information within a continuous vector space, underpinning various downstream tasks such as clustering [3], [4], retrieval [5]-[8], and classification [9]. Following the success of instruction-based multi-task training methods [10], [11], the focus of research has shifted toward achieving universal embeddings [12]-[14], where a single model provides robust representations across diverse tasks and domains. The rapid growth of multimedia applications has further driven the need for universal multimodal embed-dings [15]-[18] capable of supporting both uni-modal and cross-modal retrieval.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found