Reinforcement Learning
Advantage Updating Applied to a Differential Game
Harmon, Mance E., III, Leemon C. Baird, Klopf, A. Harry
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations.
Adaptive Load Balancing: A Study in Multi-Agent Learning
Schaerf, A., Shoham, Y., Tennenholtz, M.
We study the process of multi-agent reinforcement learning in the context ofload balancing in a distributed system, without use of either centralcoordination or explicit communication. We first define a precise frameworkin which to study adaptive load balancing, important features of which are itsstochastic nature and the purely local information available to individualagents. Given this framework, we show illuminating results on the interplaybetween basic adaptive behavior parameters and their effect on systemefficiency. We then investigate the properties of adaptive load balancing inheterogeneous populations, and address the issue of exploration vs.exploitation in that context. Finally, we show that naive use ofcommunication may not improve, and might even harm system efficiency.
Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning
Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor lambda. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(lambda) for arbitrary lambda, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(lambda), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using lambda > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.
Temporal Difference Learning of Position Evaluation in the Game of Go
Schraudolph, Nicol N., Dayan, Peter, Sejnowski, Terrence J.
Computational Neurobiology Laboratory The Salk Institute for Biological Studies San Diego, CA 92186-5800 Abstract The game of Go has a high branching factor that defeats the tree search approach used in computer chess, and long-range spatiotemporal interactionsthat make position evaluation extremely difficult. Development of conventional Go programs is hampered by their knowledge-intensive nature. We demonstrate a viable alternative by training networks to evaluate Go positions via temporal difference(TD) learning. Our approach is based on network architectures that reflect the spatial organization of both input and reinforcement signals on the Go board, and training protocols that provide exposure to competent (though unlabelled) play. These techniques yield far better performance than undifferentiated networks trained by selfplay alone.A network with less than 500 weights learned within 3,000 games of 9x9 Go a position evaluation function that enables a primitive one-ply search to defeat a commercial Go program at a low playing level. 1 INTRODUCTION Go was developed three to four millenia ago in China; it is the oldest and one of the most popular board games in the world.
The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-Spaces
Parti-game is a new algorithm for learning from delayed rewards in high dimensional real-valued state-spaces. In high dimensions it is essential that learning does not explore or plan over state space uniformly. Part i-game maintains a decision-tree partitioning of state-space and applies game-theory and computational geometry techniques to efficiently and reactively concentrate high resolution only on critical areas. Many simulated problems have been tested, ranging from 2-dimensional to 9-dimensional state-spaces, including mazes, path planning, nonlinear dynamics, and uncurling snake robots in restricted spaces. In all cases, a good solution is found in less than twenty trials and a few minutes. 1 REINFORCEMENT LEARNING Reinforcement learning [Samuel, 1959, Sutton, 1984, Watkins, 1989, Barto et al., 1991] is a promising method for control systems to program and improve themselves.
Foraging in an Uncertain Environment Using Predictive Hebbian Learning
Montague, P. Read, Dayan, Peter, Sejnowski, Terrence J.
Survival is enhanced by an ability to predict the availability of food, the likelihood of predators, and the presence of mates. We present a concrete model that uses diffuse neurotransmitter systems to implement a predictive version of a Hebb learning rule embedded in a neural architecture based on anatomical and physiological studies on bees. The model captured the strategies seen in the behavior of bees and a number of other animals when foraging in an uncertain environment. The predictive model suggests a unified way in which neuromodulatory influences can be used to bias actions and control synaptic plasticity. Successful predictions enhance adaptive behavior by allowing organisms to prepare for future actions, rewards, or punishments. Moreover, it is possible to improve upon behavioral choices if the consequences of executing different actions can be reliably predicted. Although classical and instrumental conditioning results from the psychological literature [1] demonstrate that the vertebrate brain is capable of reliable prediction, how these predictions are computed in brains is not yet known. The brains of vertebrates and invertebrates possess small nuclei which project axons throughout large expanses of target tissue and deliver various neurotransmitters such as dopamine, norepinephrine, and acetylcholine [4]. The activity in these systems may report on reinforcing stimuli in the world or may reflect an expectation of future reward [5, 6,7,8].
Convergence of Stochastic Iterative Dynamic Programming Algorithms
Jaakkola, Tommi, Jordan, Michael I., Singh, Satinder P.
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DPbased learning algorithms to the powerful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD("\) and Q-Iearning belong. 1 INTRODUCTION Learning to predict the future and to find an optimal way of controlling it are the basic goals of learning systems that interact with their environment. A variety of algorithms are currently being studied for the purposes of prediction and control in incompletely specified, stochastic environments. Here we consider learning algorithms defined in Markov environments. There are actions or controls (u) available for the learner that affect both the state transition probabilities, and the probability distribution for the immediate, state dependent costs (Ci(u)) incurred by the learner.
Foraging in an Uncertain Environment Using Predictive Hebbian Learning
Montague, P. Read, Dayan, Peter, Sejnowski, Terrence J.
Survival is enhanced by an ability to predict the availability of food, the likelihood of predators, and the presence of mates. We present a concrete model that uses diffuse neurotransmitter systems to implement a predictive version of a Hebb learning rule embedded in a neural architecture based on anatomical and physiological studies on bees. The model captured the strategies seen in the behavior of bees and a number of other animals when foraging in an uncertain environment. The predictive model suggests a unified way in which neuromodulatory influences can be used to bias actions and control synaptic plasticity. Successful predictions enhance adaptive behavior by allowing organisms to prepare for future actions, rewards, or punishments. Moreover, it is possible to improve upon behavioral choices if the consequences of executing different actions can be reliably predicted. Although classical and instrumental conditioning results from the psychological literature [1] demonstrate that the vertebrate brain is capable of reliable prediction, how these predictions are computed in brains is not yet known. The brains of vertebrates and invertebrates possess small nuclei which project axons throughout large expanses of target tissue and deliver various neurotransmitters such as dopamine, norepinephrine, and acetylcholine [4]. The activity in these systems may report on reinforcing stimuli in the world or may reflect an expectation of future reward [5, 6,7,8].
Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach
Boyan, Justin A., Littman, Michael L.
The field of reinforcement learning has grown dramatically over the past several years, but with the exception of backgammon [8, 2], has had few successful applications to large-scale, practical tasks. This paper demonstrates that the practical task of routing packets through a communication network is a natural application for reinforcement learning algorithms.