Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

May-31-2025, 11:48:25 GMT–Neural Information Processing Systems

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features.

contribution, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-31-2025, 11:48:25 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Maryland (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.67)
    - Natural Language (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)