Goto

Collaborating Authors

 slotdiffusion





Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation

Akan, Adil Kaan, Yemez, Yucel

arXiv.org Artificial Intelligence

We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature. The real world is inherently structured with distinct, composable parts and objects that can be combined in various ways; this compositional characteristic is essential for cognitive functions like reasoning, understanding causality, and ability to generalize beyond training data (Lake et al., 2017; Bottou, 2014; Schölkopf et al., 2021; Bahdanau et al., 2019; Fodor & Pylyshyn, 1988). While language clearly reflects this modularity through sentences made up of distinct words and tokens, the compositional structure is less obvious in visual data. Object-centric learning (OCL) offers a promising approach to uncover this latent structure by grouping related features into coherent object representations without supervision (Kahneman et al., 1992; Greff et al., 2020). By decomposing complex scenes into separate objects and their interactions, OCL mimics how humans interpret their environment (Spelke & Kinzler, 2007), potentially improving the robustness and interpretability of AI systems (Lake et al., 2017; Schölkopf et al., 2021). This approach shifts from traditional pixelbased feature extraction to a more meaningful segmentation of visual data, which is key for better generalization and supporting high-level reasoning tasks. Recent advances in OCL have shown the potential to incorporate powerful generative models, such as transformers and diffusion models, into the OCL framework as image decoders. Notably, models such as Latent Slot Diffusion (LSD) (Jiang et al., 2023) and SlotDiffusion (Wu et al., 2023b) have considerably improved performance in object discovery and visual generation tasks in real-world settings by employing slot-conditioned diffusion models.



SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Wu, Ziyi, Hu, Jingyu, Lu, Wuyue, Gilitschenski, Igor, Garg, Animesh

arXiv.org Artificial Intelligence

Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.