Object-centric binding in Contrastive Language-Image Pretraining
–Neural Information Processing Systems
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies that rely on the design of finegrained hard-negative augmentations. Instead, our work focuses on integrating inductive biases into the pretraining of CLIP-like models to improve their compositional understanding. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities.
Neural Information Processing Systems
Jun-14-2026, 02:36:31 GMT
- Technology: