Unsupervised Object Learning via Common Fate

Tangemann, Matthias, Schneider, Steffen, von Kügelgen, Julius, Locatello, Francesco, Gehler, Peter, Brox, Thomas, Kümmerer, Matthias, Bethge, Matthias, Schölkopf, Bernhard

arXiv.org Machine Learning 

In human vision, the Principle of Common Fate of Gestalt Psychology (Wertheimer, 2012) has been shown to play an important role for object learning (Spelke, 1990). It posits that elements that are moving together tend to be perceived as one--a perceptual bias that may have evolved to be able to recognize camouflaged predators (Troscianko et al., 2009). In our work, we show that this principle can be successfully used also for machine vision by using it in a multi-stage object learning approach (Figure 1): First, we use unsupervised motion segmentation to obtain a candidate segmentation of a video frame. Second, we train generative object and background models on this segmentation. While the regions obtained by the motion segmentation are caused by objects moving in 3D, only visible parts can be segmented. To learn the actual objects (i.e., the causes), a crucial task for the object model is learning to generalize beyond the occlusions present in its input data. To measure success, we provide a dataset including object ground truth. As the last stage, we show that the learned object and background models can be combined into a flexible scene model that allows sampling manipulated novel scenes. Thus, in contrast to existing object-centric models trained end-to-end, our work aims at decomposing object learning into evaluable subproblems and testing the potential of exploiting object motions for building scalable object-centric models that allow for causally meaningful interventions in generation.