Reinforcement Learning
Finding Structure in Reinforcement Learning
Thrun, Sebastian, Schwartz, Anton
Reinforcement learning addresses the problem of learning to select actions in order to maximize one's performance in unknown environments. To scale reinforcement learning to complex real-world tasks, such as typically studied in AI, one must ultimately be able to discover the structure in the world, in order to abstract away the myriad of details and to operate in more tractable problem spaces. This paper presents the SKILLS algorithm. SKILLS discovers skills, which are partially defined action policies that arise in the context of multiple, related tasks.
An Actor/Critic Algorithm that is Equivalent to Q-Learning
Crites, Robert H., Barto, Andrew G.
We prove the convergence of an actor/critic algorithm that is equivalent to Q-Iearning by construction. Its equivalence is achieved by encoding Q-values within the policy and value function of the actor and critic. The resultant actor/critic algorithm is novel in two ways: it updates the critic only when the most probable action is executed from any given state, and it rewards the actor using criteria that depend on the relative probability of the action that was executed.
Reinforcement Learning Methods for Continuous-Time Markov Decision Problems
Bradtke, Steven J., Duff, Michael O.
Semi-Markov Decision Problems are continuous time generalizations of discrete time Markov Decision Problems. A number of reinforcement learning algorithms have been developed recently for the solution of Markov Decision Problems, based on the ideas of asynchronous dynamic programming and stochastic approximation. Among these are TD(,x), Q-Iearning, and Real-time Dynamic Programming. After reviewing semi-Markov Decision Problems and Bellman's optimality equation in that context, we propose algorithms similar to those named above, adapted to the solution of semi-Markov Decision Problems. We demonstrate these algorithms by applying them to the problem of determining the optimal control for a simple queueing system. We conclude with a discussion of circumstances under which these algorithms may be usefully applied.
Instance-Based State Identification for Reinforcement Learning
This paper presents instance-based state identification, an approach to reinforcement learning and hidden state that builds disambiguating amounts of short-term memory online, and also learns with an order of magnitude fewer training steps than several previous approaches. Inspired by a key similarity between learning with hidden state and learning in continuous geometrical spaces, this approach uses instance-based (or "memory-based") learning, a method that has worked well in continuous spaces. 1 BACKGROUND AND RELATED WORK When a robot's next course of action depends on information that is hidden from the sensors because of problems such as occlusion, restricted range, bounded field of view and limited attention, the robot suffers from hidden state. More formally, we say a reinforcement learning agent suffers from the hidden state problem if the agent's state representation is non-Markovian with respect to actions and utility. The hidden state problem arises as a case of perceptual aliasing: the mapping between states of the world and sensations of the agent is not one-to-one [Whitehead, 1992]. If the agent's perceptual system produces the same outputs for two world states in which different actions are required, and if the agent's state representation consists only of its percepts, then the agent will fail to choose correct actions.
Reinforcement Learning with Soft State Aggregation
Singh, Satinder P., Jaakkola, Tommi, Jordan, Michael I.
It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. In this paper we address the pressing issue of combining function approximation and RL, and present 1) a function approximator based on a simple extension to state aggregation (a commonly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.
Advantage Updating Applied to a Differential Game
Harmon, Mance E., III, Leemon C. Baird, Klopf, A. Harry
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations.
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
Jaakkola, Tommi, Singh, Satinder P., Jordan, Michael I.
Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo policy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy. Although the space of stochastic policies is continuous-even for a discrete action space-our algorithm is computationally tractable.
A Novel Reinforcement Model of Birdsong Vocalization Learning
Doya, Kenji, Sejnowski, Terrence J.
Songbirds learn to imitate a tutor song through auditory and motor learning. We have developed a theoretical framework for song learning that accounts for response properties of neurons that have been observed in many of the nuclei that are involved in song learning. Specifically, we suggest that the anteriorforebrain pathway, which is not needed for song production in the adult but is essential for song acquisition, provides synaptic perturbations and adaptive evaluations for syllable vocalization learning. A computer model based on reinforcement learning was constructed that could replicate a real zebra finch song with 90% accuracy based on a spectrographic measure. The second generation of the birdsong model replicated the tutor song with 96% accuracy.