Plotting

 Watters, Nicholas


Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task

arXiv.org Artificial Intelligence

From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation.


Modular Object-Oriented Games: A Task Framework for Reinforcement Learning, Psychology, and Neuroscience

arXiv.org Artificial Intelligence

In recent years, trends towards studying object-based games have gained momentum in the fields of artificial intelligence, cognitive science, psychology, and neuroscience. In artificial intelligence, interactive physical games are now a common testbed for reinforcement learning (Franรงois-Lavet et al., 2018; Leike et al., 2017; Mnih et al., 2013; Sutton and Barto, 2018) and object representations are of particular interest for sample efficient and generalizable AI (Battaglia et al., 2018; Greff et al., 2020; van Steenkiste et al., 2019). In cognitive science and psychology, object-based games are used to study a variety of cognitive capacities, such as planning, intuitive physics, and intuitive psychology (Chabris, 2017; Ullman et al., 2017). Developmental psychologists also use object-based visual stimuli to probe questions about object-oriented reasoning in infants and young animals (Spelke and Kinzler, 2007; Wood et al., 2020). In neuroscience, object-based computer games have recently been used to study decision-making and physical reasoning in both human and non-human primates (Fischer et al., 2016; McDonald et al., 2019; Rajalingham et al., 2021; Yoo et al., 2020). Furthermore, a growing number of researchers are studying tasks using a combination of approaches from these fields.


A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

arXiv.org Machine Learning

Disentangled representations have recently been shown to improve data efficiency, generalisation, robustness and interpretability in simple supervised and reinforcement learning tasks. To extend such results to more complex domains, it is important to address a major shortcoming of the current state of the art unsupervised disentangling approaches -- high convergence variance, whereby different disentanglement quality may be achieved by the same model depending on its initial state. The existing model selection methods require access to the ground truth attribute labels, which are not available for most datasets. Hence, the benefits of disentangled representations have not yet been fully explored in practical applications. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. We show that our approach performs comparably to the existing supervised alternatives across 5400 models from six state of the art unsupervised disentangled representation learning model classes.


COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

arXiv.org Artificial Intelligence

Recent advances in deep reinforcement learning (RL) have shown remarkable success on challenging tasks (Andrychowicz et al., 2018; Mnih et al., 2015; Silver et al., 2016). However, data efficiency and robustness to new contexts remain persistent challenges for deep RL algorithms, especially when the goal is for agents to learn practical tasks with limited supervision. Drawing inspiration from self-supervised "play" in human development (Gopnik et al., 1999; Settles, 2011), we introduce an agent that learns object-centric representations of its environment without supervision and subsequently harnesses these to learn policies efficiency and robustly. Our agent, which we call Curious Object-Based seaRch Agent (COBRA), brings together three key ingredients: (i) learning representations of the world in terms of objects, (ii) curiosity-driven exploration, and (iii) model based RL. The benefits of this synthesis are data efficiency and policy robustness. To put this into practice, we introduce the following technical contributions: - A method for learning action-conditioned dynamics over slot-structured object-centric representations that requires no supervision and is trained from raw pixels.


MONet: Unsupervised Scene Decomposition and Representation

arXiv.org Machine Learning

The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.


Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

arXiv.org Machine Learning

We present a simple neural rendering architecture that helps variational autoencoders (VAEs) learn disentangled representations. Instead of the deconvolutional network typically used in the decoder of VAEs, we tile (broadcast) the latent vector across space, concatenate fixed X- and Y-"coordinate" channels, and apply a fully convolutional network with 1x1 stride. This provides an architectural prior for dissociating positional from non-positional features in the latent distribution of VAEs, yet without providing any explicit supervision to this effect. We show that this architecture, which we term the Spatial Broadcast decoder, improves disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. We also emphasize a method for visualizing learned latent spaces that helped us diagnose our models and may prove useful for others aiming to assess data representations. Finally, we show the Spatial Broadcast Decoder is complementary to state-of-the-art (SOTA) disentangling techniques and when incorporated improves their performance.


Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Neural Information Processing Systems

Intelligent behaviour in the real-world requires the ability to acquire new knowledge from an ongoing sequence of experiences while preserving and reusing past knowledge. We propose a novel algorithm for unsupervised representation learning from piece-wise stationary visual data: Variational Autoencoder with Shared Embeddings (VASE). Based on the Minimum Description Length principle, VASE automatically detects shifts in the data distribution and allocates spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting. Our approach encourages the learnt representations to be disentangled, which imparts a number of desirable properties: VASE can deal sensibly with ambiguous inputs, it can enhance its own representations through imagination-based exploration, and most importantly, it exhibits semantically meaningful sharing of latents between different datasets. Compared to baselines with entangled representations, our approach is able to reason beyond surface-level statistics and perform semantically meaningful cross-domain inference.


Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Neural Information Processing Systems

Intelligent behaviour in the real-world requires the ability to acquire new knowledge from an ongoing sequence of experiences while preserving and reusing past knowledge. We propose a novel algorithm for unsupervised representation learning from piece-wise stationary visual data: Variational Autoencoder with Shared Embeddings (VASE). Based on the Minimum Description Length principle, VASE automatically detects shifts in the data distribution and allocates spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting. Our approach encourages the learnt representations to be disentangled, which imparts a number of desirable properties: VASE can deal sensibly with ambiguous inputs, it can enhance its own representations through imagination-based exploration, and most importantly, it exhibits semantically meaningful sharing of latents between different datasets. Compared to baselines with entangled representations, our approach is able to reason beyond surface-level statistics and perform semantically meaningful cross-domain inference.


Visual Interaction Networks: Learning a Physics Simulator from Video

Neural Information Processing Systems

From just a glance, humans can make rich predictions about the future of a wide range of physical systems. On the other hand, modern approaches from engineering, robotics, and graphics are often restricted to narrow domains or require information about the underlying state. We introduce the Visual Interaction Network, a generalpurpose modelfor learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Through joint training, the perceptual front-end learns to parse a dynamic visual scene into a set of factored latent object representations. The dynamics predictor learns to roll these states forward in time by computing their interactions, producing a predicted physical trajectory of arbitrary length. We found that from just six input video frames the Visual Interaction Network can generate accurate future trajectories of hundreds of time steps on a wide range of physical systems. Our model can also be applied to scenes with invisible objects, inferring their future states from their effects on the visible objects, and can implicitly infer the unknown mass of objects. This work opens new opportunities for model-based decision-making and planning from raw sensory observations in complex physical environments.