Enhancing CLIP Robustness via Cross-Modality Alignment

Jun-10-2026, 19:44:14 GMT–Neural Information Processing Systems

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance.

artificial intelligence, machine learning, natural language, (6 more...)

Neural Information Processing Systems

Jun-10-2026, 19:44:14 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.62)
  - Machine Learning (0.56)