Object-level Vision-Language Contrastive Pre-training
Since 2021, with the emergence of CLIP, contrastive pre-training has been extended to supervised learning with image-text pairs. The purpose of this extension is to improve the transferability of learned visual features by aligning them to the corresponding text features, as text features are proved to be highly transferrable in GPT-3. By this extension, visual models could be adapted to downstream tasks in a zero/few-shot learning manner with good performances and data-efficiency. The original CLIP pre-training is image-level, and the learned features are mainly used for image-level downstream tasks, e.g.
Sep-23-2022, 18:16:42 GMT
- Technology: