C-Pack: Packaged Resources To Advance General Chinese Embedding

Xiao, Shitao, Liu, Zheng, Zhang, Peitian, Muennighoff, Niklas

Dec-15-2023–arXiv.org Artificial Intelligence

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

arxiv preprint arxiv, c-mtp, dataset, (13 more...)

arXiv.org Artificial Intelligence

Dec-15-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Asia
  - Middle East
    - Jordan (0.04)
    - Israel (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report > Promising Solution (0.46)

Technology:
- Information Technology
  - Communications > Social Media (0.68)
  - Artificial Intelligence
    - Representation & Reasoning (0.68)
    - Machine Learning > Statistical Learning (0.67)
    - Natural Language
      - Large Language Model (0.69)
      - Chatbot (0.47)
      - Text Processing (0.46)
      - Information Retrieval (0.46)