KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Hu, Xinshuo, Shan, Zifei, Zhao, Xinping, Sun, Zetian, Liu, Zhenyu, Li, Dongfang, Ye, Shaolin, Wei, Xinyuan, Chen, Qian, Hu, Baotian, Wang, Haofen, Yu, Jun, Zhang, Min
–arXiv.org Artificial Intelligence
As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with less than 1B parameters.
arXiv.org Artificial Intelligence
Jan-14-2025
- Country:
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States
- District of Columbia > Washington (0.04)
- Washington > King County
- Seattle (0.04)
- Texas > Travis County
- Austin (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- New York > New York County
- New York City (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Florida > Miami-Dade County
- Miami (0.14)
- Mexico > Mexico City
- Mexico City (0.04)
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Austria > Vienna (0.14)
- Germany > Berlin (0.04)
- Spain
- Galicia > Madrid (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Italy
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- France > Auvergne-Rhône-Alpes
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Singapore (0.04)
- Indonesia > Bali (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Middle East
- Jordan (0.04)
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- India > Karnataka
- Bengaluru (0.04)
- China
- Hong Kong (0.04)
- Guangdong Province > Shenzhen (0.04)
- Shanghai > Shanghai (0.04)
- Heilongjiang Province > Harbin (0.04)
- Africa > Rwanda
- South America > Colombia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine (0.93)
- Technology: