Reinforcement Learning
Repeated Inverse Reinforcement Learning
Amin, Kareem, Jiang, Nan, Singh, Satinder
We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.
Adaptive coordination of working-memory and reinforcement learning in non-human primates performing a trial-and-error problem solving task
Viejo, Guillaume, Girard, Benoît, Procyk, Emmanuel, Khamassi, Mehdi
Accumulating evidence suggest that human behavior in trial-and-error learning tasks based on decisions between discrete actions may involve a combination of reinforcement learning (RL) and working-memory (WM). While the understanding of brain activity at stake in this type of tasks often involve the comparison with non-human primate neurophysiological results, it is not clear whether monkeys use similar combined RL and WM processes to solve these tasks. Here we analyzed the behavior of five monkeys with computational models combining RL and WM. Our model-based analysis approach enables to not only fit trial-by-trial choices but also transient slowdowns in reaction times, indicative of WM use. We found that the behavior of the five monkeys was better explained in terms of a combination of RL and WM despite inter-individual differences. The same coordination dynamics we used in a previous study in humans best explained the behavior of some monkeys while the behavior of others showed the opposite pattern, revealing a possible different dynamics of WM process. We further analyzed different variants of the tested models to open a discussion on how the long pretraining in these tasks may have favored particular coordination dynamics between RL and WM. This points towards either inter-species differences or protocol differences which could be further tested in humans.
Shallow Updates for Deep Reinforcement Learning
Levine, Nir, Zahavy, Tom, Mankowitz, Daniel J., Tamar, Aviv, Mannor, Shie
Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.
On- and Off-Policy Monotonic Policy Improvement
Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this paper, by lower bounding the performance difference of two policies, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples. An optimization procedure which applies the proposed bound can be regarded as an off-policy natural policy gradient method. In order to support the theoretical result, we provide a trust region policy optimization method using experience replay as a naive application of our bound, and evaluate its performance in two classical benchmark problems.
Learning Hard Alignments with Variational Inference
Lawson, Dieterich, Chiu, Chung-Cheng, Tucker, George, Raffel, Colin, Swersky, Kevin, Jaitly, Navdeep
There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work used REINFORCE and Q-learning to approach these issues, but those methods can provide high-variance gradient estimates and be slow to train. In this paper, we tackle the problem of learning hard attention for a sequential task using variational inference methods, specifically the recently introduced VIMCO and NVIL. Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We demonstrate our method on a phoneme recognition task in clean and noisy environments and show that our method outperforms REINFORCE, with the difference being greater for a more complicated task.
Integrating Knowledge Representation, Reasoning, and Learning for Human-Robot Interaction
Sridharan, Mohan (The University of Auckland)
Robots interacting with humans often have to represent and reason with different descriptions of incomplete domain knowledge and uncertainty, and revise this knowledge over time. Towards achieving these capabilities, the architecture described in this paper combines the complementary strengths of declarative programming, probabilistic graphical models, and reinforcement learning. For any given goal, non-monotonic logical reasoning with a coarse-resolution representation of the domain is used to compute a tentative plan of abstract actions. Each abstract action is implemented as a sequence of concrete actions by reasoning probabilistically over the relevant part of a fine-resolution representation tightly-coupled to the coarse-resolution representation. The outcomes of executing the concrete actions are used for subsequenct reasoning at the coarse resolution. Furthermore, the task of interactively learning axioms governing action capabilities, preconditions and effects, is posed as a relational reinforcement learning problem, using decision tree regression and sampling to construct and generalize over candidate axioms. These capabilities are illustrated in simulation and on a physical robot moving objects to specific people or locations in an indoor domain.
An Integrated Computational Framework for Attention, Reinforcement Learning, and Working Memory
Stocco, Andrea (University of Washington)
This paper proposes a reinterpretation of selective attention as a form of control of working memory based on self-generated reward signals and model-free reinforcement learning. In addition to being simple and parsimonious, this approach systematizes a number of classic psychological constructs without calling for additional, specific mechanisms. Finally, the papers presents the results of an empirical test of this framework, and elaborates on the implications of our findings for general models of control and intelligent behavior, as well as neurobiological models of the basal ganglia.
A Framework Using Machine Vision and Deep Reinforcement Learning for Self-Learning Moving Objects in a Virtual Environment
Wu, Richard (University of Massachusetts Dartmouth) | Zhao, Ying (Naval Postgraduate School) | Clarke, Alan (Naval Postgraduate School) | Kendall, Anthony (Naval Postgraduate School)
In recent artificial intelligence (AI) research, convolutional neural networks (CNNs) can create artificial agents capable of self-learning. Self-learning autonomous moving objects utilize machine vision techniques based on processing and recognizing objects in digital images. Afterwards, deep reinforcement learning (Deep-RL) is applied to understand and learn intelligent actions and controls. The objective of our research is to study methods and designs on how machine vision and deep machine learning algorithms can be implemented in a virtual world (e.g., a computer game) for moving objects (e.g., vehicles or aircrafts) to improve their navigation and detection of threats in real life. In this paper, we create a framework for generating and using data from computer games to be used in CNNs and Deep-RL to perform intelligent actions. We show the initial results of applying the framework and identify various military applications that may benefit from this research.
Toward Supervised Reinforcement Learning with Partial States for Social HRI
Senft, Emmanuel (Plymouth University) | Lemaignan, Séverin (Plymouth University) | Baxter, Paul (University of Lincoln) | Belpaeme, Tony (Plymouth University)
Social interacting is a complex task for which machine learning holds particular promise. However, as no sufficiently accurate simulator of human interactions exists today, the learning of social interaction strategies has to happen online in the real world. Actions executed by the robot impact on humans, and as such have to be carefully selected, making it impossible to rely on random exploration. Additionally, no clear reward function exists for social interactions. This implies that traditional approaches used for Reinforcement Learning cannot be directly applied for learning how to interact with the social world. As such we argue that robots will profit from human expertise and guidance to learn social interactions. However, as the quantity of input a human can provide is limited, new methods have to be designed to use human input more efficiently. In this paper we describe a setup in which we combine a framework called Supervised Progressively Autonomous Robot Competencies (SPARC), which allows safer online learning with Reinforcement Learning, with the use of partial states rather than full states to accelerate generalisation and obtain a usable action policy more quickly.