Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Wei, Yubai, Han, Jiale, Yang, Yi
–arXiv.org Artificial Intelligence
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.
arXiv.org Artificial Intelligence
Oct-22-2025
- Country:
- Asia
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- France > Île-de-France
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Dominican Republic (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Oregon > Benton County
- Corvallis (0.04)
- Florida > Miami-Dade County
- Canada > British Columbia
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Retail (0.30)
- Technology: