Reinforcement Learning
Generalization in Reinforcement Learning: Safely Approximating the Value Function
Boyan, Justin A., Moore, Andrew W.
Reinforcement learning-the problem of getting an agent to learn to act from sparse, delayed rewards-has been advanced by techniques based on dynamic programming (DP). These algorithms compute a value function which gives, for each state, the minimum possiblelong-term cost commencing in that state. For the high-dimensional and continuous state spaces characteristic of real-world control tasks, a discrete representation ofthe value function is intractable; some form of generalization is required. A natural way to incorporate generalization into DP is to use a function approximator, rather than a lookup table, to represent the value function. This approach, which dates back to uses of Legendre polynomials in DP [Bellman et al., 19631, has recently worked well on several dynamic control problems [Mahadevan and Connell, 1990, Lin, 1993] and succeeded spectacularly on the game of backgammon [Tesauro, 1992, Boyan, 1992].
Reinforcement Learning with Soft State Aggregation
Singh, Satinder P., Jaakkola, Tommi, Jordan, Michael I.
It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. Inthis paper we address the pressing issue of combining function approximation and RL, and present 1) a function approximator basedon a simple extension to state aggregation (a commonly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.
Advantage Updating Applied to a Differential Game
Harmon, Mance E., III, Leemon C. Baird, Klopf, A. Harry
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations.
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
Jaakkola, Tommi, Singh, Satinder P., Jordan, Michael I.
Increasing attention has been paid to reinforcement learning algorithms inrecent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo policy evaluationcombined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerablybetter than any deterministic policy. Although the space of stochastic policies is continuous-even for a discrete action space-our algorithm is computationally tractable.
A Novel Reinforcement Model of Birdsong Vocalization Learning
Doya, Kenji, Sejnowski, Terrence J.
Songbirds learn to imitate a tutor song through auditory and motor learning. Wehave developed a theoretical framework for song learning that accounts for response properties of neurons that have been observed in many of the nuclei that are involved in song learning. Specifically, we suggest that the anteriorforebrain pathway, which is not needed for song production in the adult but is essential for song acquisition, provides synaptic perturbations and adaptive evaluations for syllable vocalization learning. A computer model based on reinforcement learning was constructed thatcould replicate a real zebra finch song with 90% accuracy based on a spectrographic measure.
Adaptive Load Balancing: A Study in Multi-Agent Learning
Schaerf, A., Shoham, Y., Tennenholtz, M.
We study the process of multi-agent reinforcement learning in the context ofload balancing in a distributed system, without use of either centralcoordination or explicit communication. We first define a precise frameworkin which to study adaptive load balancing, important features of which are itsstochastic nature and the purely local information available to individualagents. Given this framework, we show illuminating results on the interplaybetween basic adaptive behavior parameters and their effect on systemefficiency. We then investigate the properties of adaptive load balancing inheterogeneous populations, and address the issue of exploration vs.exploitation in that context. Finally, we show that naive use ofcommunication may not improve, and might even harm system efficiency.
Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning
Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor lambda. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(lambda) for arbitrary lambda, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(lambda), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using lambda > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.
Robot Learning: Exploration and Continuous Domains
The goal of this workshop was to discuss two major issues: efficient exploration of a learner's state space, and learning in continuous domains. The common themes that emerged in presentations and in discussion were the importance of choosing one's domain assumptions carefully, mixing controllers/strategies, avoidance of catastrophic failure, new approaches with difficulties with reinforcement learning, and the importance of task transfer. He suggested that neither "fewer assumptions are better" nor "more assumptions are better" is a tenable position, and that we should strive to find and use standard sets of assumptions. With no such commonality, comparison of techniques and results is meaningless. Under Moore's guidance, the group discussed the possibility of designing an algorithm which used a number of well-chosen assumption sets and switched between them according to their empirical validity.
The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-Spaces
Parti-game is a new algorithm for learning from delayed rewards in high dimensional real-valued state-spaces. In high dimensions it is essential that learning does not explore or plan over state space uniformly. Part i-game maintains a decision-tree partitioning of state-space and applies game-theory and computational geometry techniques to efficiently and reactively concentrate high resolution only on critical areas. Many simulated problems have been tested, ranging from 2-dimensional to 9-dimensional state-spaces, including mazes, path planning, nonlinear dynamics, and uncurling snake robots in restricted spaces. In all cases, a good solution is found in less than twenty trials and a few minutes. 1 REINFORCEMENT LEARNING Reinforcement learning [Samuel, 1959, Sutton, 1984, Watkins, 1989, Barto et al., 1991] is a promising method for control systems to program and improve themselves.