WATT: Weight Average Test Time Adaptation of CLIP

Neural Information Processing Systems 

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performances for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts.