CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
Salehi, Mohammadreza, Farajtabar, Mehrdad, Horton, Maxwell, Faghri, Fartash, Pouransari, Hadi, Vemulapalli, Raviteja, Tuzel, Oncel, Farhadi, Ali, Rastegari, Mohammad, Mehta, Sachin
–arXiv.org Artificial Intelligence
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification. Foundation Models (FMs) are revolutionizing different domains of artificial intelligence and machine learning, including computer vision (Radford et al., 2021; He et al., 2022; Kirillov et al., 2023b) and natural language processing (Devlin et al., 2018; Brown et al., 2020; Touvron et al., 2023). FMs can be trained on web crawled data without relying on crowd or expert annotations, and yet they demonstrate strong generalization capabilities (Jia et al., 2021; Schuhmann et al., 2022).
arXiv.org Artificial Intelligence
Oct-21-2023
- Genre:
- Research Report > New Finding (0.93)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.68)
- Natural Language (1.00)
- Vision > Image Understanding (0.56)
- Information Technology > Artificial Intelligence