Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

May-26-2025, 23:19:10 GMT–Neural Information Processing Systems

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes.

machine learning, natural language, vision-language model adaptation, (7 more...)

Neural Information Processing Systems

May-26-2025, 23:19:10 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.65)
  - Natural Language (0.65)
  - Representation & Reasoning (0.43)
  - Machine Learning (0.43)