Towards Realistic Unsupervised Fine-tuning with CLIP

Liang, Jian, Sheng, Lijun, Wang, Zhengbo, He, Ran, Tan, Tieniu

Aug-24-2023–arXiv.org Artificial Intelligence

Vision-language models (VLMs) [Radford et al., 2021, Li et al., 2022a, Jia et al., 2021, Li et al., 2022c] pre-trained on web-scale image-text pairs have exhibited robust zero-shot prediction capabilities, which have recently attracted increasing attention from the research community. As an example, Contrastive Language-Image Pretraining (CLIP) [Radford et al., 2021] leverages a contrastive objective to obtain a modality-agnostic embedding space in which the paired images and texts are pulled closer and unpaired images and texts are pushed apart. Subsequently, CLIP can perform zero-shot visual prediction by matching the embeddings of test images and prompt-based textual descriptions (e.g., "a photo of a [CLASS]" and "this is a picture of a [CLASS]"), merely requiring the names of all the semantic classes in downstream tasks. Apart from the extensive research dedicated to the pre-training stage, numerous studies [Zhou et al., 2022b, Zhang et al., 2022b, Bahng et al., 2022] have concentrated on adapting VLMs to specific downstream tasks by using task-specific labeled data. This fine-tuning paradigm empowers VLMs to bridge both data and task gaps, leading to improved performance in recognition tasks. In addition to multi-class classification, these pioneering strategies have also been harnessed in a spectrum of computer vision tasks, including ordinal regression [Li et al., 2022b], point cloud understanding [Zhang et al., 2022a], and dense prediction [Rao et al., 2022]. When considering fine-tuning setups, most efforts have primarily revolved around fully supervised and few-shot supervised learning scenarios. To pursue annotation efficiency and scalability, several recent studies [Huang et al., 2022, Shu et al., 2022, Tanwisuth et al., 2023] have delved into the realm of unsupervised fine-tuning for VLMs, remarkably achieving performance on par with few-shot supervised

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Aug-24-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China > Jiangsu Province > Nanjing (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Large Language Model (0.54)
    - Text Processing (0.48)
  - Machine Learning > Performance Analysis
    - Accuracy (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found