McClelland, James L.
Emergent Systematic Generalization in a Situated Agent
Hill, Felix, Lampinen, Andrew, Schneider, Rosalia, Clark, Stephen, Botvinick, Matthew, McClelland, James L., Santoro, Adam
The question of whether deep neural networks are good at generalising beyond their immediate training experience is of critical importance for learning-based approaches to AI. Here, we demonstrate strong emergent systematic generalisation in a neural network agent and isolate the factors that support this ability. In environments ranging from a grid-world to a rich interactive 3D Unity room, we show that an agent can correctly exploit the compositional nature of a symbolic language to interpret never-seen-before instructions. We observe this capacity not only when instructions refer to object properties (colors and shapes) but also verb-like motor skills (lifting and putting) and abstract modifying operations (negation). We identify three factors that can contribute to this facility for systematic generalisation: (a) the number of object/word experiences in the training set; (b) the invariances afforded by a first-person, egocentric perspective; and (c) the variety of visual input experienced by an agent that perceives the world actively over time. Thus, while neural nets trained in idealised or reduced situations may fail to exhibit a compositional or systematic understanding of their experience, this competence can readily emerge when, like human learners, they have access to many examples of richly varying, multi-modal observations as they learn.
Embedded Meta-Learning: Toward more flexible deep-learning models
Lampinen, Andrew K., McClelland, James L.
Toward this goal, we propose a new class of challenges, and a class of architectures that can solve them. The challenges are meta-mappings, which involve systematically transforming task behaviors to adapt to new tasks zero-shot. We suggest that the key to achieving these challenges is representing the task being performed along with the computations used to perform it. We therefore draw inspiration from meta-learning and functional programming to propose a class of Embedded Meta-Learning (EML) architectures that represent both data and tasks in a shared latent space. EML architectures are applicable to any type of machine learning task, including supervised learning and reinforcement learning. We demonstrate the flexibility of these architectures by showing that they can perform meta-mappings, i.e. that they can exhibit zero-shot remapping of behavior to adapt to new tasks.
A mathematical theory of semantic development in deep neural networks
Saxe, Andrew M., McClelland, James L., Ganguli, Surya
An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities.
One-shot and few-shot learning of word embeddings
Lampinen, Andrew K., McClelland, James L.
Standard deep learning systems require thousands or millions of examples to learn a concept, and cannot integrate new concepts easily. By contrast, humans have an incredible ability to do one-shot or few-shot learning. For instance, from just hearing a word used in a sentence, humans can infer a great deal about it, by leveraging what the syntax and semantics of the surrounding words tells us. Here, we draw inspiration from this to highlight a simple technique by which deep recurrent networks can similarly exploit their prior knowledge to learn a useful representation for a new word from little data. This could make natural language processing systems much more flexible, by allowing them to learn continually from the new words they encounter.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe, Andrew M., McClelland, James L., Ganguli, Surya
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Information Factorization in Connectionist Models of Perception
Movellan, Javier R., McClelland, James L.
We examine a psychophysical law that describes the influence of stimulus and context on perception. According to this law choice probability ratios factorize into components independently controlled bystimulus and context. It has been argued that this pattern of results is incompatible with feedback models of perception. In this paper we examine this claim using neural network models defined via stochastic differential equations. We show that the law is related to a condition named channel separability and has little to do with the existence of feedback connections. In essence, channels areseparable if they converge into the response units without direct lateral connections to other channels and if their sensors are not directly contaminated by external inputs to the other channels. Implicationsof the analysis for cognitive and computational neurosicence are discussed.
Information Factorization in Connectionist Models of Perception
Movellan, Javier R., McClelland, James L.
We examine a psychophysical law that describes the influence of stimulus and context on perception. According to this law choice probability ratios factorize into components independently controlled by stimulus and context. It has been argued that this pattern of results is incompatible with feedback models of perception. In this paper we examine this claim using neural network models defined via stochastic differential equations. We show that the law is related to a condition named channel separability and has little to do with the existence of feedback connections. In essence, channels are separable if they converge into the response units without direct lateral connections to other channels and if their sensors are not directly contaminated by external inputs to the other channels. Implications of the analysis for cognitive and computational neurosicence are discussed.
A Hippocampal Model of Recognition Memory
O', Reilly, Randall C., Norman, Kenneth A., McClelland, James L.
A rich body of data exists showing that recollection of specific information makesan important contribution to recognition memory, which is distinct from the contribution of familiarity, and is not adequately captured byexisting unitary memory models. Furthennore, neuropsychological evidence indicates that recollection is subserved by the hippocampus. We present a model, based largely on known features of hippocampal anatomy and physiology, that accounts for the following key characteristics ofrecollection: 1) false recollection is rare (i.e., participants rarely claim to recollect having studied nonstudied items), and 2) increasing interference leadsto less recollection but apparently does not compromise the quality of recollection (i.e., the extent to which recollected infonnation veridicallyreflects events that occurred at study).
A Hippocampal Model of Recognition Memory
O', Reilly, Randall C., Norman, Kenneth A., McClelland, James L.
A rich body of data exists showing that recollection of specific information makes an important contribution to recognition memory, which is distinct from the contribution of familiarity, and is not adequately captured by existing unitary memory models. Furthennore, neuropsychological evidence indicates that recollection is sub served by the hippocampus. We present a model, based largely on known features of hippocampal anatomy and physiology, that accounts for the following key characteristics of recollection: 1) false recollection is rare (i.e., participants rarely claim to recollect having studied nonstudied items), and 2) increasing interference leads to less recollection but apparently does not compromise the quality of recollection (i.e., the extent to which recollected infonnation veridically reflects events that occurred at study).
Learning Sequential Structure in Simple Recurrent Networks
Servan-Schreiber, David, Cleeremans, Axel, McClelland, James L.
The network uses the pattern of activation over a set of hidden units from time-step tl, together with element t, to predict element t 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. Cluster analyses of the hidden-layer patterns of activation showed that they encode prediction-relevant information about the entire path traversed through the network. We illustrate the phases of learning with cluster analyses performed at different points during training. Several connectionist architectures that are explicitly constrained to capture sequential infonnation have been developed. Examples are Time Delay Networks (e.g.