Enhancing CLIP Robustness via Cross-Modality Alignment
–Neural Information Processing Systems
Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance.
Neural Information Processing Systems
Jun-10-2026, 19:44:14 GMT
- Technology: