WATT: Weight Average Test Time Adaptation of CLIP

May-27-2025, 01:38:30 GMT–Neural Information Processing Systems

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performances for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing the overall test performance by aggregating diverse textual cues.Our findings underscore the effectiveness of WATT across diverse datasets, including CIFAR-10-C, CIFAR-10.1, Notably, these enhancements are achieved without the need for additional model transformations or trainable modules.

artificial intelligence, natural language, weight average test time adaptation, (3 more...)

Neural Information Processing Systems

May-27-2025, 01:38:30 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.62)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.62)
  - Natural Language (0.62)