Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training

Huang, Junqin, Hu, Zhongjie, Jing, Zihao, Gao, Mengya, Wu, Yichao

May-11-2024–arXiv.org Artificial Intelligence

Text embedding models play a pivotal role in natural language processing and machine learning. By encoding texts into structured numerical representations, known as text embeddings, these models encapsulate semantic and contextual information of words, phrases, or entire documents within a dense, lowdimensional vector space [27]. Such embeddings are indispensable for various downstream NLP tasks, including classification, clustering, retrieval, and sentence similarity. Contrastive learning stands out as the most effective technique for training text embedding models [6]. It presents text semantic representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs. Beyond its application in natural language processing (NLP), contrastive learning also proves pivotal in visual [8] [5] and multi-modal [25] representation learning. Recent advanced text embedding works [36] [33] [18] primarily rely on a two-stage pretrain-finetune pipeline to acquire general text embedding models. Pre-training utilizes weakly supervised data sourced from large-scale crawling efforts, while fine-tuning involves refining the model with high-quality text pairs obtained through data mining or manual annotation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-11-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found