Towards Realistic Unsupervised Fine-tuning with CLIP
Liang, Jian, Sheng, Lijun, Wang, Zhengbo, He, Ran, Tan, Tieniu
–arXiv.org Artificial Intelligence
Vision-language models (VLMs) [Radford et al., 2021, Li et al., 2022a, Jia et al., 2021, Li et al., 2022c] pre-trained on web-scale image-text pairs have exhibited robust zero-shot prediction capabilities, which have recently attracted increasing attention from the research community. As an example, Contrastive Language-Image Pretraining (CLIP) [Radford et al., 2021] leverages a contrastive objective to obtain a modality-agnostic embedding space in which the paired images and texts are pulled closer and unpaired images and texts are pushed apart. Subsequently, CLIP can perform zero-shot visual prediction by matching the embeddings of test images and prompt-based textual descriptions (e.g., "a photo of a [CLASS]" and "this is a picture of a [CLASS]"), merely requiring the names of all the semantic classes in downstream tasks. Apart from the extensive research dedicated to the pre-training stage, numerous studies [Zhou et al., 2022b, Zhang et al., 2022b, Bahng et al., 2022] have concentrated on adapting VLMs to specific downstream tasks by using task-specific labeled data. This fine-tuning paradigm empowers VLMs to bridge both data and task gaps, leading to improved performance in recognition tasks. In addition to multi-class classification, these pioneering strategies have also been harnessed in a spectrum of computer vision tasks, including ordinal regression [Li et al., 2022b], point cloud understanding [Zhang et al., 2022a], and dense prediction [Rao et al., 2022]. When considering fine-tuning setups, most efforts have primarily revolved around fully supervised and few-shot supervised learning scenarios. To pursue annotation efficiency and scalability, several recent studies [Huang et al., 2022, Shu et al., 2022, Tanwisuth et al., 2023] have delved into the realm of unsupervised fine-tuning for VLMs, remarkably achieving performance on par with few-shot supervised
arXiv.org Artificial Intelligence
Aug-24-2023