Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Chen, Zixiang, Deng, Yihe, Li, Yuanzhi, Gu, Quanquan
Multi-modal learning (Ngiam et al., 2011) integrates information from a variety of data types, resulting in AI systems that are both robust and precise. Recently, CLIP (Radford et al., 2021) emerged as a milestone work that leverages vision-language contrastive pretraining to jointly learn image and text embeddings, using the vast amounts of image-text data available on the web. During the training process, CLIP considers image-text data that appear together as positive pairs and other combinations as negative pairs. The goal is to maximize the embedding similarity for the positive pairs while minimizing it for the negative pairs. Remarkably, this approach has achieved significant success in zero-shot transfer (Lei Ba et al., 2015), indicating the model's ability to handle a great variety of tasks without prior exposure to any of their training data. Inspired by CLIP's groundbreaking zero-shot capabilities, subsequent studies (Yao et al., 2022; Li et al., 2022; Mu et al., 2022; Goel et al., 2022; Zhai et al., 2022; Alayrac et al., 2022) emerged with the primary objective of further enhancing CLIP's zero-shot performance. Despite the empirical success of CLIP in zero-shot transfer, the theoretical understanding of how it works remains elusive. An intriguing inquiry is thus: How does CLIP learn representations that are transferable to the various downstream tasks?
Oct-2-2023
- Country:
- Genre:
- Research Report (1.00)
- Technology: