How Much Can CLIP Benefit Vision-and-Language Tasks?

Shen, Sheng, Li, Liunian Harold, Tan, Hao, Bansal, Mohit, Rohrbach, Anna, Chang, Kai-Wei, Yao, Zhewei, Keutzer, Kurt

Jul-13-2021–arXiv.org Artificial Intelligence

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

dataset, proceedings, visual encoder, (13 more...)

arXiv.org Artificial Intelligence

Jul-13-2021

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Paraná > Curitiba (0.04)
- North America > United States
  - North Carolina (0.04)
  - California
    - Los Angeles County > Los Angeles (0.14)
    - Alameda County > Berkeley (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.67)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)