Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Neural Information Processing Systems 

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a second generation'' of pseudolabeling approaches that do not require task-specific training on labeled data.