Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training

Huang, Junqin, Hu, Zhongjie, Jing, Zihao, Gao, Mengya, Wu, Yichao

arXiv.org Artificial Intelligence 

Text embedding models play a pivotal role in natural language processing and machine learning. By encoding texts into structured numerical representations, known as text embeddings, these models encapsulate semantic and contextual information of words, phrases, or entire documents within a dense, lowdimensional vector space [27]. Such embeddings are indispensable for various downstream NLP tasks, including classification, clustering, retrieval, and sentence similarity. Contrastive learning stands out as the most effective technique for training text embedding models [6]. It presents text semantic representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs. Beyond its application in natural language processing (NLP), contrastive learning also proves pivotal in visual [8] [5] and multi-modal [25] representation learning. Recent advanced text embedding works [36] [33] [18] primarily rely on a two-stage pretrain-finetune pipeline to acquire general text embedding models. Pre-training utilizes weakly supervised data sourced from large-scale crawling efforts, while fine-tuning involves refining the model with high-quality text pairs obtained through data mining or manual annotation.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found