Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

Dec-23-2024–arXiv.org Artificial Intelligence

Text embedding models play a crucial role in natural language processing, particularly in information retrieval, by mapping text data into a semantically rich vector space. The importance of information retrieval has been further highlighted with the recent utilization of RAG (Retrieval-Augmented Generation) (Lewis et al., 2020) to address the issues of hallucination and outdated information in large language models (LLMs). Pre-trained text embedding models on a massive corpus have significantly improved the quality of text representation. BGE M3-Embedding (Chen et al., 2024) is a representative model that shows outstanding performance in multilingual text embedding and information retrieval. This study proposes an efficient fine-tuning methodology to enhance the information retrieval performance of pre-trained text embedding models by specializing them to a specific domain: 1. Efficient Training Data Selection Technique: Applies ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) (Xiong et al., 2020) for selecting negative samples in the training data.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Dec-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia > South Korea (0.15)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Information Retrieval (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)