Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

Yu, Jeongsu

arXiv.org Artificial Intelligence 

Text embedding models play a crucial role in natural language processing, particularly in information retrieval, by mapping text data into a semantically rich vector space. The importance of information retrieval has been further highlighted with the recent utilization of RAG (Retrieval-Augmented Generation) (Lewis et al., 2020) to address the issues of hallucination and outdated information in large language models (LLMs). Pre-trained text embedding models on a massive corpus have significantly improved the quality of text representation. BGE M3-Embedding (Chen et al., 2024) is a representative model that shows outstanding performance in multilingual text embedding and information retrieval. This study proposes an efficient fine-tuning methodology to enhance the information retrieval performance of pre-trained text embedding models by specializing them to a specific domain: 1. Efficient Training Data Selection Technique: Applies ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) (Xiong et al., 2020) for selecting negative samples in the training data.