S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Jan-19-2025, 21:11:11 GMT–Neural Information Processing Systems

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and adapting to such domains is challenging due to the limited number of image-text pairs available for training. To address this, we propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images. S-CLIP employs two pseudo-labeling strategies specifically designed for contrastive learning and the language modality. The caption-level pseudo-label is given by a combination of captions of paired images, obtained by solving an optimal transport problem between unpaired and paired images.

s-clip, semi-supervised vision-language learning, specialist caption, (1 more...)

Neural Information Processing Systems

Jan-19-2025, 21:11:11 GMT

Conferences Web Page

Add feedback

Industry:
- Education > Curriculum > Subject-Specific Education (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.62)