Parts of Speech–Grounded Subspaces in Vision-Language Models
–Neural Information Processing Systems
Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g.
Neural Information Processing Systems
Dec-23-2025, 19:42:07 GMT
- Technology:
- Information Technology > Artificial Intelligence > Vision (0.79)