Parts of Speech–Grounded Subspaces in Vision-Language Models

Dec-23-2025, 19:42:07 GMT–Neural Information Processing Systems

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g.

name change, representation, vision-language model, (7 more...)

Neural Information Processing Systems

Dec-23-2025, 19:42:07 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.79)