Gao, Mengya
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
Huang, Junqin, Hu, Zhongjie, Jing, Zihao, Gao, Mengya, Wu, Yichao
Text embedding models play a pivotal role in natural language processing and machine learning. By encoding texts into structured numerical representations, known as text embeddings, these models encapsulate semantic and contextual information of words, phrases, or entire documents within a dense, lowdimensional vector space [27]. Such embeddings are indispensable for various downstream NLP tasks, including classification, clustering, retrieval, and sentence similarity. Contrastive learning stands out as the most effective technique for training text embedding models [6]. It presents text semantic representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs. Beyond its application in natural language processing (NLP), contrastive learning also proves pivotal in visual [8] [5] and multi-modal [25] representation learning. Recent advanced text embedding works [36] [33] [18] primarily rely on a two-stage pretrain-finetune pipeline to acquire general text embedding models. Pre-training utilizes weakly supervised data sourced from large-scale crawling efforts, while fine-tuning involves refining the model with high-quality text pairs obtained through data mining or manual annotation.
INTERN: A New Learning Paradigm Towards General Vision
Shao, Jing, Chen, Siyu, Li, Yangguang, Wang, Kun, Yin, Zhenfei, He, Yinan, Teng, Jianing, Sun, Qinghong, Gao, Mengya, Liu, Jihao, Huang, Gengshi, Song, Guanglu, Wu, Yichao, Huang, Yuming, Liu, Fenggang, Peng, Huan, Qin, Shuo, Wang, Chengyu, Wang, Yujie, He, Conghui, Liang, Ding, Liu, Yu, Yu, Fengwei, Yan, Junjie, Lin, Dahua, Wang, Xiaogang, Qiao, Yu
Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner.