Enhancing via Cross Modality Alignment

Jun-15-2026, 06:41:15 GMT–Neural Information Processing Systems

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose CrOss-modaLity Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Jun-15-2026, 06:41:15 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Representation & Reasoning (0.92)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found