Hermann, Karl Moritz, Hill, Felix, Green, Simon, Wang, Fumin, Faulkner, Ryan, Soyer, Hubert, Szepesvari, David, Czarnecki, Wojciech Marian, Jaderberg, Max, Teplyashin, Denis, Wainwright, Marcus, Apps, Chris, Hassabis, Demis, Blunsom, Phil
We are increasingly surrounded by artificially intelligent technology that takes decisions and executes actions on our behalf. This creates a pressing need for general means to communicate with, instruct and guide artificial agents, with human language the most compelling means for such communication. To achieve this in a scalable fashion, agents must be able to relate language to the world and to actions; that is, their understanding of language must be grounded and embodied. However, learning grounded language is a notoriously challenging problem in artificial intelligence research. Here we present an agent that learns to interpret language in a simulated 3D environment where it is rewarded for the successful execution of written instructions. Trained via a combination of reinforcement and unsupervised learning, and beginning with minimal prior knowledge, the agent learns to relate linguistic symbols to emergent perceptual representations of its physical surroundings and to pertinent sequences of actions. The agent's comprehension of language extends beyond its prior experience, enabling it to apply familiar language to unfamiliar situations and to interpret entirely novel instructions. Moreover, the speed with which this agent learns new words increases as its semantic knowledge grows. This facility for generalising and bootstrapping semantic knowledge indicates the potential of the present approach for reconciling ambiguous natural language with the complexity of the physical world.
Abstract-- Understanding and following directions provided by humans can enable robots to navigate effectively in unknown situations. FollowNet processes instructions using an attention mechanism conditioned on its visual and depth input to focus on the relevant parts of the command while performing the navigation task. Deep reinforcement learning (RL) a sparse reward learns simultaneously the state representation, the attention function, and control policies. We evaluate our agent on a dataset of complex natural language directions that guide the agent through a rich and realistic dataset of simulated homes. We show that the FollowNet agent learns to execute previously unseen instructions described with a similar vocabulary, and successfully navigates along paths not encountered during training. The agent shows 30% improvement over a baseline model without the attention mechanism, with 52% success rate at novel instructions. Humans often navigate unknown environments by observing their surroundings and following directions. These directions consist predominantly of landmarks and directional instructions and other common words.
Developing systems that can execute symbolic, language-like instructions in the physical world is a long-standing challenge for Artificial Intelligence. Previous attempts to replicate human-like grounded language understanding involved hard-coding linguistic and physical principles, which is notoriously laborious and difficult to scale. Here we show that a simple neural-network based agent without any hard-coded knowledge can exploit general-purpose learning algorithms to infer the meaning of sequential symbolic instructions as they pertain to a simulated 3D world. Beginning with no prior knowledge, the agents learn the meaning of concrete nouns, adjectives, more abstract relational predicates and longer, order-dependent, sequences of symbols. The agent naturally generalises predicates to unfamiliar objects, and can interpret word combinations (phrases) that it has never seen before.
Neural network-based systems can now learn to locate the referents of words and phrases in images, answer questions about visual scenes, and execute symbolic instructions as first-person actors in partially-observable worlds. To achieve this so-called grounded language learning, models must overcome challenges that infants face when learning their first words. While it is notable that models with no meaningful prior knowledge overcome these obstacles, researchers currently lack a clear understanding of how they do so, a problem that we attempt to address in this paper. For maximum control and generality, we focus on a simple neural network-based language learning agent, trained via policy-gradient methods, which can interpret single-word instructions in a simulated 3D world. Whilst the goal is not to explicitly model infant word learning, we take inspiration from experimental paradigms in developmental psychology and apply some of these to the artificial agent, exploring the conditions under which established human biases and learning effects emerge. We further propose a novel method for visualising semantic representations in the agent.
While deep reinforcement learning techniques have led to agents that are successfully able to learn to perform a number of tasks that had been previously unlearnable, these techniques are still susceptible to the longstanding problem of reward sparsity. This is especially true for tasks such as training an agent to play StarCraft II, a real-time strategy game where reward is only given at the end of a game which is usually very long. While this problem can be addressed through reward shaping, such approaches typically require a human expert with specialized knowledge. Inspired by the vision of enabling reward shaping through the more-accessible paradigm of natural-language narration, we develop a technique that can provide the benefits of reward shaping using natural language commands. Our narration-guided RL agent projects sequences of natural-language commands into the same high-dimensional representation space as corresponding goal states. We show that we can get improved performance with our method compared to traditional reward-shaping approaches. Additionally, we demonstrate the ability of our method to generalize to unseen natural-language commands.