CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Didolkar, Aniket, Zadaianchuk, Andrii, Awal, Rabiul, Seitzer, Maximilian, Gavves, Efstratios, Agrawal, Aishwarya

Mar-27-2025–arXiv.org Artificial Intelligence

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Mar-27-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.67)
- North America > Canada (0.28)

Genre:
- Research Report > Promising Solution (0.48)

Industry:
- Health & Medicine (0.49)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.47)
    - Natural Language > Large Language Model (0.93)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)