"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.
Welcome to this series on reinforcement learning! We'll first start out by introducing the absolute basics to build a solid ground for us to run. We'll then progress onto more advanced and sophisticated topics that integrate artificial neural networks and deep learning into reinforcement learning. We'll also be getting our hands dirty by implementing some super cool reinforcement learning projects in code! Without further ado, let's get to it!
Sports betting is a popular past-time for many and a great use-case for an important concept known as dynamic programming that I'll introduce in this video. We'll go over concepts like value iteration, the markov decision process, and the bellman optimality principle, all to help create a system that will help US optimally bet on the winning hockey team in order to maximize profits. That's what keeps me going.
When it is not in our power to determine what is true, we ought to act in accordance with what is most probable. While neural networks are responsible for recent breakthroughs in problems like computer vision, machine translation and time series prediction – they can also combine with reinforcement learning algorithms to create something astounding like AlphaGo. Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or maximize along a particular dimension over many steps; for example, maximize the points won in a game over many moves. They can start from a blank slate, and under the right conditions they achieve superhuman performance. Like a child incentivized by spankings and candy, these algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones – this is reinforcement.
Deep reinforcement learning techniques have demonstrated superior performance in a wide variety of environments. As improvements in training algorithms continue at a brisk pace, theoretical or empirical studies on understanding what these networks seem to learn, are far behind. In this paper we propose an interpretable neural network architecture for Q-learning which provides a global explanation of the model's behavior using key-value memories, attention and reconstructible embeddings. With a directed exploration strategy, our model can reach training rewards comparable to the state-of-the-art deep Q-learning models. However, results suggest that the features extracted by the neural network are extremely shallow and subsequent testing using out-of-sample examples shows that the agent can easily overfit to trajectories seen during training.
Reinforcement Learning methods are capable of solving complex problems, but resulting policies might perform poorly in environments that are even slightly different. In robotics especially, training and deployment conditions often vary and data collection is expensive, making retraining undesirable. Simulation training allows for feasible training times, but on the other hand suffers from a reality-gap when applied in real-world settings. This raises the need of efficient adaptation of policies acting in new environments. We consider this as a problem of transferring knowledge within a family of similar Markov decision processes. For this purpose we assume that Q-functions are generated by some low-dimensional latent variable. Given such a Q-function, we can find a master policy that can adapt given different values of this latent variable. Our method learns both the generative mapping and an approximate posterior of the latent variables, enabling identification of policies for new tasks by searching only in the latent space, rather than the space of all policies. The low-dimensional space, and master policy found by our method enables policies to quickly adapt to new environments. We demonstrate the method on both a pendulum swing-up task in simulation, and for simulation-to-real transfer on a pushing task.
As all physical adaptive quantum-enhanced metrology schemes operate under noisy conditions with only partially understood noise characteristics, so a practical control policy must be robust even for unknown noise. We aim to devise a test to evaluate the robustness of AQEM policies and assess the resource used by the policies. The robustness test is performed on adaptive phase estimation by simulating the scheme under four phase noise models corresponding to the normal-distribution noise, the random telegraph noise, the skew-normal-distribution noise, and the log-normal-distribution noise. The control policies are devised either by a reinforcement-learning algorithm in the same noise condition, albeit ignorant of its properties, or a Bayesian-based feedback method that assumes no noise. Our robustness test and resource comparison can be used to determining the efficacy and selecting a suitable policy.
Early detection of cyber-attacks is crucial for a safe and reliable operation of the smart grid. In the literature, outlier detection schemes making sample-by-sample decisions and online detection schemes requiring perfect attack models have been proposed. In this paper, we formulate the online attack/anomaly detection problem as a partially observable Markov decision process (POMDP) problem and propose a universal robust online detection algorithm using the framework of model-free reinforcement learning (RL) for POMDPs. Numerical studies illustrate the effectiveness of the proposed RL-based algorithm in timely and accurate detection of cyber-attacks targeting the smart grid.
Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.
Unity Technologies on Monday released version 0.5 of its ML-Agents toolkit to make its Unity 3D game development platform better suited for developing and training autonomous agent code via machine learning. Initially rolled out a year ago in beta, version 0.5 comes with a few improvements. There's a wrapper for Gym (a toolkit for developing and testing reinforcement learning algorithms), support for letting agents make multiple action selections at once and for preventing agents from taking certain actions, and a refurbished set of environments called Marathon Environments. In these virtual spaces, AI researchers can teach software agents to perform certain tasks by rewarding them for correct actions. This sort of reinforcement learning can be limited to digital environments like video games or mapped to software-driven machines in the real world.
The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent's updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.