Reinforcement Learning
Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
Jaques, Natasha, Gu, Shixiang, Bahdanau, Dzmitry, Hernández-Lobato, José Miguel, Turner, Richard E., Eck, Douglas
This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.
Two AIs Go Head-to-Head on Atari's 'Breakout' to Test Deep Learning
It seems like every day brings a new AI more capable than the last. This was recently apparent with AlphaGo--it was pretty great at beating Breakout, then Google got involved and soon it was capable of beating the world's leading Go champion. To do this, AlphaGo uses what is known as'deep reinforcement learning'. For example, in Breakout, it will take raw image frames of the game as it's being played. Whether or not the ball is hitting the bricks in those frames will decide whether or not positive reinforcement is registered.
Manifold Regularization for Kernelized LSTD
Yan, Xinyan, Choromanski, Krzysztof, Boots, Byron, Sindhwani, Vikas
Policy evaluation or value function or Q-function approximation is a key procedure in reinforcement learning (RL). It is a necessary component of policy iteration and can be used for variance reduction in policy gradient methods. Therefore its quality has a significant impact on most RL algorithms. Motivated by manifold regularized learning, we propose a novel kernelized policy evaluation method that takes advantage of the intrinsic geometry of the state space learned from data, in order to achieve better sample efficiency and higher accuracy in Q-function approximation. Applying the proposed method in the Least-Squares Policy Iteration (LSPI) framework, we observe superior performance compared to widely used parametric basis functions on two standard benchmarks in terms of policy quality.
Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning
Luckett, Daniel J., Laber, Eric B., Kahkoska, Anna R., Maahs, David M., Mayer-Davis, Elizabeth, Kosorok, Michael R.
The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best healthcare possible for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient's health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.
video-friday-spoon-robotic-creatures-ros-industrial-machine-knitting?utm_source=feedburner-robotics&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrumRobotics+%28IEEE+Spectrum%3A+Robotics%29
Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been shown to scale to high-dimensional dexterous manipulation. Furthermore, deployment of DRL on physical systems remains challenging due to sample inefficiency. In this work, we show that model-free DRL with natural policy gradients can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand, and solve them from scratch in simulated experiments. We demonstrate successful policies for multiple complex tasks: object relocation, in-hand manipulation, tool use, and dooropening.
May the Best AI Win: Artificial Intelligence Learns Sumo Wrestling (VIDEO)
RoboSumo, one of the latest Open AI experiments in machine learning, involves a pair of'robots' dropped into a virtual arena without even the knowledge necessary to walk, and forced to learn the tricks of sumo wrestling purely by trial and error. The video posted on YouTube shows how the bots initially clash without employing any tactics or strategy, but after a number of bouts their movements start to resemble those of human wrestlers, as they learn to dodge and attack. According to the Wired, OpenAI researchers created RoboSumo because the competition apparently generated extra complexity which "could allow faster progress than just giving reinforcement learning software more complex problems to solve alone." "When you interact with other agents you have to adapt; if you don't you'll lose," Maruan Al-Shedivat, one of the RoboSumo creators, said.
Unsupervised Real-Time Control through Variational Empowerment
Karl, Maximilian, Soelch, Maximilian, Becker-Ehmck, Philip, Benbouzid, Djalel, van der Smagt, Patrick, Bayer, Justin
We introduce a methodology for efficiently computing a lower bound to empowerment, allowing it to be used as an unsupervised cost function for policy learning in real-time control. Empowerment, being the channel capacity between actions and states, maximises the influence of an agent on its near future. It has been shown to be a good model of biological behaviour in the absence of an extrinsic goal. But empowerment is also prohibitively hard to compute, especially in nonlinear continuous spaces. We introduce an efficient, amortised method for learning empowerment-maximising policies. We demonstrate that our algorithm can reliably handle continuous dynamical systems using system dynamics learned from raw data. The resulting policies consistently drive the agents into states where they can use their full potential.
Sparse Markov Decision Processes with Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning
Lee, Kyungjae, Choi, Sungjoon, Oh, Songhwai
Arkov decision processes (MDPs) have been widely used as a mathematical framework to solve stochastic sequential decision problems, such as autonomous driving [1], path planning [2], and quadrotor control [3]. In general, the goal of an MDP is to find the optimal policy function which maximizes the expected return. The expected return is a performance measure of a policy function and it is often defined as the expected sum of discounted rewards. An MDP is often used to formulate reinforcement learning (RL) [4], which aims to find the optimal policy without the explicit specification of stochasticity of an environment, and inverse reinforcement learning (IRL) [5], whose goal is to search the proper reward function that can explain the behavior of an expert who follows the underlying optimal policy. While the optimal solution of an MDP is a deterministic policy, it is not desirable to apply an MDP to the problems with multiple optimal actions. In perspective of RL, the knowledge of multiple optimal actions makes it possible to cope with unexpected situations. For example, suppose that an autonomous vehicle has multiple optimal routes to reach a given goal. If a traffic accident occurs at the currently selected optimal route, it is possible to avoid the accident by choosing another safe optimal route without additional computation of a new optimal route.
Is Epicurus the father of Reinforcement Learning?
The Epicurean Philosophy is commonly thought as simplistic and hedonistic. Here I discuss how this is a misconception and explore its link to Reinforcement Learning. Based on the letters of Epicurus, I construct an objective function for hedonism which turns out to be equivalent of the Reinforcement Learning objective function when omitting the discount factor. I then discuss how Plato and Aristotle 's views that can be also loosely linked to Reinforcement Learning, as well as their weaknesses in relationship to it. Finally, I emphasise the close affinity of the Epicurean views and the Bellman equation.
Using Task Descriptions in Lifelong Machine Learning for Improved Performance and Zero-Shot Transfer
Isele, David, Rostami, Mohammad, Eaton, Eric
Knowledge transfer between tasks can improve the performance of learned models, but requires an accurate estimate of the inter-task relationships to identify the relevant knowledge to transfer. These inter-task relationships are typically estimated based on training data for each task, which is inefficient in lifelong learning settings where the goal is to learn each consecutive task rapidly from as little data as possible. To reduce this burden, we develop a lifelong learning method based on coupled dictionary learning that utilizes high-level task descriptions to model the inter-task relationships. We show that using task descriptors improves the performance of the learned task policies, providing both theoretical justification for the benefit and empirical demonstration of the improvement across a variety of learning problems. Given only the descriptor for a new task, the lifelong learner is also able to accurately predict a model for the new task through zero-shot learning using the coupled dictionary, eliminating the need to gather training data before addressing the task.