un 2 CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Jun-11-2026, 07:42:13 GMT–Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding.

artificial intelligence, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Jun-11-2026, 07:42:13 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.58)
  - Machine Learning (0.39)