AITopics | object-centric model

Collaborating Authors

object-centric model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models Ziyi Wu

Neural Information Processing SystemsNov-19-2025, 12:10:40 GMT

In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation.

machine learning, natural language, slotdiffusion, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Is an object-centric representation beneficial for robotic manipulation ?

Chapin, Alexandre, Dellandrea, Emmanuel, Chen, Liming

arXiv.org Artificial IntelligenceJun-25-2025

Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos. It has been several times presented as a potential way to improve data-efficiency and generalization capabilities to learn an agent on downstream tasks. However, most existing work only evaluates such models on scene decomposition, without any notion of reasoning over the learned representation. Robotic manipulation tasks generally involve multi-object environments with potential inter-object interaction. We thus argue that they are a very interesting playground to really evaluate the potential of existing object-centric work. To do so, we create several robotic manipulation tasks in simulated environments involving multiple objects (several distractors, the robot, etc.) and a high-level of randomization (object positions, colors, shapes, background, initial positions, etc.). We then evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations. Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2506.19408

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

Chapin, Alexandre, Machado, Bruno, Dellandrea, Emmanuel, Chen, Liming

arXiv.org Artificial IntelligenceMay-20-2025

Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Artificial Intelligence

2505.11563

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Didolkar, Aniket, Zadaianchuk, Andrii, Awal, Rabiul, Seitzer, Maximilian, Gavves, Efstratios, Agrawal, Aishwarya

arXiv.org Artificial IntelligenceMar-27-2025

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.21747

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Greece (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry: Health & Medicine (0.49)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Successes and Limitations of Object-centric Models at Compositional Generalisation

Montero, Milton L., Bowers, Jeffrey S., Malhotra, Gaurav

arXiv.org Artificial IntelligenceDec-24-2024

In recent years, it has been shown empirically that standard disentangled latent variable models do not support robust compositional learning in the visual domain. Indeed, in spite of being designed with the goal of factorising datasets into their constituent factors of variations, disentangled models show extremely limited compositional generalisation capabilities. On the other hand, object-centric architectures have shown promising compositional skills, albeit these have 1) not been extensively tested and 2) experiments have been limited to scene composition -- where models must generalise to novel combinations of objects in a visual scene instead of novel combinations of object properties. In this work, we show that these compositional generalisation skills extend to this later setting. Furthermore, we present evidence pointing to the source of these skills and how they can be improved through careful training. Finally, we point to one important limitation that still exists which suggests new directions of research.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.18743

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Zero-Shot Object-Centric Representation Learning

Didolkar, Aniket, Zadaianchuk, Andrii, Goyal, Anirudh, Mozer, Mike, Bengio, Yoshua, Martius, Georg, Seitzer, Maximilian

arXiv.org Artificial IntelligenceAug-17-2024

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

arxiv, dataset, inosaur, (17 more...)

arXiv.org Artificial Intelligence

2408.09162

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)

Add feedback

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Gandhi, Sanket, Atul, null, Mahajan, Samanyu, Sharma, Vishal, Gupta, Rushil, Mondal, Arnab Kumar, Singla, Parag

arXiv.org Artificial IntelligenceJul-3-2024

Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citep{slot_attention}, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2407.03216

Country:

Asia > India > NCT > Delhi (0.05)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Tackling the Abstraction and Reasoning Corpus (ARC) with Object-centric Models and the MDL Principle

Ferré, Sébastien

arXiv.org Artificial IntelligenceNov-1-2023

The Abstraction and Reasoning Corpus (ARC) is a challenging benchmark, introduced to foster AI research towards human-level intelligence. It is a collection of unique tasks about generating colored grids, specified by a few examples only. In contrast to the transformation-based programs of existing work, we introduce object-centric models that are in line with the natural programs produced by humans. Our models can not only perform predictions, but also provide joint descriptions for input/output pairs. The Minimum Description Length (MDL) principle is used to efficiently search the large model space. A diverse range of tasks are solved, and the learned models are similar to the natural programs. We demonstrate the generality of our approach by applying it to a different domain.

abstraction and reasoning corpus, mdl principle, object-centric model, (2 more...)

arXiv.org Artificial Intelligence

2311.00545

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.89)

Add feedback

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Wu, Ziyi, Hu, Jingyu, Lu, Wuyue, Gilitschenski, Igor, Garg, Animesh

arXiv.org Artificial IntelligenceSep-21-2023

Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

dataset, decoder, slotdiffusion, (14 more...)

arXiv.org Artificial Intelligence

2305.11281

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.63)

Industry:

Media > Photography (0.48)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

object-centric model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models Ziyi Wu

9fa03b16dbd6cabc7601fe98c6ec291e-Paper-Conference.pdf

Is an object-centric representation beneficial for robotic manipulation ?

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Successes and Limitations of Object-centric Models at Compositional Generalisation

Zero-Shot Object-Centric Representation Learning

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Tackling the Abstraction and Reasoning Corpus (ARC) with Object-centric Models and the MDL Principle

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models