Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Chen, Zixiang, Deng, Yihe, Li, Yuanzhi, Gu, Quanquan

Oct-2-2023–arXiv.org Machine Learning

Multi-modal learning (Ngiam et al., 2011) integrates information from a variety of data types, resulting in AI systems that are both robust and precise. Recently, CLIP (Radford et al., 2021) emerged as a milestone work that leverages vision-language contrastive pretraining to jointly learn image and text embeddings, using the vast amounts of image-text data available on the web. During the training process, CLIP considers image-text data that appear together as positive pairs and other combinations as negative pairs. The goal is to maximize the embedding similarity for the positive pairs while minimizing it for the negative pairs. Remarkably, this approach has achieved significant success in zero-shot transfer (Lei Ba et al., 2015), indicating the model's ability to handle a great variety of tasks without prior exposure to any of their training data. Inspired by CLIP's groundbreaking zero-shot capabilities, subsequent studies (Yao et al., 2022; Li et al., 2022; Mu et al., 2022; Goel et al., 2022; Zhai et al., 2022; Alayrac et al., 2022) emerged with the primary objective of further enhancing CLIP's zero-shot performance. Despite the empirical success of CLIP in zero-shot transfer, the theoretical understanding of how it works remains elusive. An intriguing inquiry is thus: How does CLIP learn representations that are transferable to the various downstream tasks?

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Oct-2-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States
  - California > Los Angeles County
    - Los Angeles (0.14)
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.68)
    - Statistical Learning > Regression (0.46)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found