VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Jun-11-2026, 21:25:45 GMT–Neural Information Processing Systems

However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in the hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity.

artificial intelligence, machine learning, natural language, (13 more...)

Neural Information Processing Systems

Jun-11-2026, 21:25:45 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (0.59)
  - Machine Learning > Neural Networks (0.59)