Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
Huang, Junqin, Hu, Zhongjie, Jing, Zihao, Gao, Mengya, Wu, Yichao
–arXiv.org Artificial Intelligence
Text embedding models play a pivotal role in natural language processing and machine learning. By encoding texts into structured numerical representations, known as text embeddings, these models encapsulate semantic and contextual information of words, phrases, or entire documents within a dense, lowdimensional vector space [27]. Such embeddings are indispensable for various downstream NLP tasks, including classification, clustering, retrieval, and sentence similarity. Contrastive learning stands out as the most effective technique for training text embedding models [6]. It presents text semantic representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs. Beyond its application in natural language processing (NLP), contrastive learning also proves pivotal in visual [8] [5] and multi-modal [25] representation learning. Recent advanced text embedding works [36] [33] [18] primarily rely on a two-stage pretrain-finetune pipeline to acquire general text embedding models. Pre-training utilizes weakly supervised data sourced from large-scale crawling efforts, while fine-tuning involves refining the model with high-quality text pairs obtained through data mining or manual annotation.
arXiv.org Artificial Intelligence
May-11-2024