Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, Wei, Furu

Dec-7-2022–arXiv.org Artificial Intelligence

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-7-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - New York
      - New York County > New York City (0.04)
      - Broome County (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California > Los Angeles County
      - Long Beach (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Austria (0.04)
  - Romania > Sud - Muntenia Development Region
    - Giurgiu County > Giurgiu (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - South Korea (0.04)
  - China > Hong Kong (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (0.87)
  - Machine Learning
    - Statistical Learning (0.87)
    - Neural Networks > Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found