Multilingual E5 Text Embeddings: A Technical Report

Wang, Liang, Yang, Nan, Huang, Xiaolong, Yang, Linjun, Majumder, Rangan, Wei, Furu

Feb-8-2024–arXiv.org Artificial Intelligence

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Feb-8-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.14)
  - Middle East > UAE (0.15)
- Europe
  - Belgium (0.14)
  - Croatia (0.14)
  - Italy (0.14)
- North America > United States
  - Louisiana (0.14)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)