Dittadi, Andrea
Planning From Pixels in Atari With Learned Symbolic Representations
Dittadi, Andrea, Drachmann, Frederik K., Bolander, Thomas
Width-based planning methods have been shown to yield state-of-the-art performance in the Atari 2600 domain using pixel input. One successful approach, RolloutIW, represents states with the B-PROST boolean feature set. An augmented version of RolloutIW, $\pi$-IW, shows that learned features can be competitive with handcrafted ones for width-based search. In this paper, we leverage variational autoencoders (VAEs) to learn features directly from pixels in a principled manner, and without supervision. The inference model of the trained VAEs extracts boolean features from pixels, and RolloutIW plans with these features. The resulting combination outperforms the original RolloutIW and human professional play on Atari 2600 and drastically reduces the size of the feature set.
On the Transfer of Disentangled Representations in Realistic Settings
Dittadi, Andrea, Träuble, Frederik, Locatello, Francesco, Wüthrich, Manuel, Agrawal, Vaibhav, Winther, Ole, Bauer, Stefan, Schölkopf, Bernhard
Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same robotic setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and real-world settings where the encoder i) remains in distribution or ii) is out of distribution. We propose new architectures in order to scale disentangled representation learning to realistic high-resolution settings and conduct a large-scale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for out-of-distribution (OOD) task performance.
Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds
Liévin, Valentin, Dittadi, Andrea, Christensen, Anders, Winther, Ole
This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE). We prove that in the limit of large $K$ (number of importance samples) one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as $\sqrt{K}$. This is in contrast to the standard pathwise gradient estimator where the SNR decreases as $1/\sqrt{K}$. Based on our theoretical findings we develop a novel control variate that extends on VIMCO. Empirically, for the training of both continuous and discrete generative models, the proposed method yields superior variance reduction, resulting in an SNR for IWAE that increases with $K$ without relying on the reparameterization trick. The novel estimator is competitive with state-of-the-art reparameterization-free gradient estimators such as Reweighted Wake-Sleep (RWS) and the thermodynamic variational objective (TVO) when training generative models.
LAVAE: Disentangling Location and Appearance
Dittadi, Andrea, Winther, Ole
A BSTRACT We propose a probabilistic generative model for unsupervised learning of structured, interpretable, object-based representations of visual scenes. We use amortized variational inference to train the generative model end-to-end. The learned representations of object location and appearance are fully disentangled, and objects are represented independently of each other in the latent space. Unlike previous approaches that disentangle location and appearance, ours generalizes seam-lessly to scenes with many more objects than encountered in the training regime. We evaluate the proposed model on multi-MNIST and multidSprites data sets. 1 I NTRODUCTION Many hallmarks of human intelligence rely on the capability to perceive the world as a layout of distinct physical objects that endure through time--a skill that infants acquire in early childhood (Spelke, 1990; 2013; Spelke and Kinzler, 2007). Learning compositional, object-based representations of visual scenes, however, is still regarded as an open challenge for artificial systems (Ben-gio et al., 2013; Garnelo and Shanahan, 2019). Recently, there has been a growing interest in unsupervised learning of disentangled representations (Locatello et al., 2018), which should separate the distinct, informative factors of variations in the data, and contain all the information on the data in a compact, interpretable structure (Bengio et al., 2013). This notion is highly relevant in the context of visual scene representation learning, where distinct objects should arguably be represented in a disentangled fashion.