LAVAE: Disentangling Location and Appearance

Dittadi, Andrea, Winther, Ole

arXiv.org Machine Learning 

A BSTRACT We propose a probabilistic generative model for unsupervised learning of structured, interpretable, object-based representations of visual scenes. We use amortized variational inference to train the generative model end-to-end. The learned representations of object location and appearance are fully disentangled, and objects are represented independently of each other in the latent space. Unlike previous approaches that disentangle location and appearance, ours generalizes seam-lessly to scenes with many more objects than encountered in the training regime. We evaluate the proposed model on multi-MNIST and multidSprites data sets. 1 I NTRODUCTION Many hallmarks of human intelligence rely on the capability to perceive the world as a layout of distinct physical objects that endure through time--a skill that infants acquire in early childhood (Spelke, 1990; 2013; Spelke and Kinzler, 2007). Learning compositional, object-based representations of visual scenes, however, is still regarded as an open challenge for artificial systems (Ben-gio et al., 2013; Garnelo and Shanahan, 2019). Recently, there has been a growing interest in unsupervised learning of disentangled representations (Locatello et al., 2018), which should separate the distinct, informative factors of variations in the data, and contain all the information on the data in a compact, interpretable structure (Bengio et al., 2013). This notion is highly relevant in the context of visual scene representation learning, where distinct objects should arguably be represented in a disentangled fashion.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found