WATT: Weight Average Test Time Adaptation of CLIP
–Neural Information Processing Systems
Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performances for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing the overall test performance by aggregating diverse textual cues.Our findings underscore the effectiveness of WATT across diverse datasets, including CIFAR-10-C, CIFAR-10.1, Notably, these enhancements are achieved without the need for additional model transformations or trainable modules.
Neural Information Processing Systems
May-27-2025, 01:38:30 GMT
- Genre:
- Research Report > New Finding (0.62)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (0.62)
- Vision (0.62)
- Information Technology > Artificial Intelligence