Reinforcement Learning
Learning Decision Theoretic Utilities through Reinforcement Learning
Stensmo, Magnus, Sejnowski, Terrence J.
Probability models can be used to predict outcomes and compensate for missing data, but even a perfect model cannot be used to make decisions unless the utility of the outcomes, or preferences between them, are also provided. This arises in many real-world problems, such as medical diagnosis, wherethe cost of the test as well as the expected improvement in the outcome must be considered. Relatively little work has been done on learning the utilities of outcomes for optimal decision making. In this paper, we show how temporal-difference reinforcement learning (TO(Aยป can be used to determine decision theoretic utilities within the context of a mixture model and apply this new approach to a problem in medical diagnosis. TO(A) learning of utilities reduces the number of tests that have to be done to achieve the same level of performance compared with the probability model alone, which results in significant cost savings and increased efficiency.
Why did TD-Gammon Work?
Pollack, Jordan B., Blair, Alan D.
Although TD-Gammon is one of the major successes in machine learning, it has not led to similar impressive breakthroughs in temporal difference We werelearning for other applications or even other games. Instead we apply simple hill-climbing in a relative fitness environment. These results and further analysis suggest of Tesauro's program had more to do with thethat the surprising success of the learning task and the dynamics of theco-evolutionary structure backgammon game itself. 1 INTRODUCTION It took great chutzpah for Gerald Tesauro to start wasting computer cycles on temporal of Backgammon (Tesauro, 1992). After all, the dream ofprogram play itself in the hopes computers mastering a domain by self-play or "introspection" had been around since the early days of AI, forming part of Samuel's checker player (Samuel, 1959) and used in Donald Michie's MENACE tictac-toe learner (Michie, 1961). However such self-conditioning or nonexistent internal representations, had generally beensystems, with weak of scale and abandoned by the field of AI.
Multi-Grid Methods for Reinforcement Learning in Controlled Diffusion Processes
The optimal control problem reduces to a boundary value problem for a fully nonlinear second-order elliptic differential equation of Hamilton Jacobi-Bellman (HJB-) type. Numerical analysis provides multigrid methodsfor this kind of equation. In the case of Learning Control, however,the systems of equations on the various grid-levels are obtained using observed information (transitions and local cost). To ensure consistency, special attention needs to be directed toward thetype of time and space discretization during the observation. Analgorithm for multi-grid observation is proposed.
Analytical Mean Squared Error Curves in Temporal Difference Learning
Singh, Satinder P., Dayan, Peter
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal difference value estimation algorithms change with offline updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its stepsize andeligibility trace parameters. 1 INTRODUCTION A reassuring theory of asymptotic convergence is available for many reinforcement learning (RL) algorithms. What is not available, however, is a theory that explains the finite-term learning curve behavior of RL algorithms, e.g., what are the different kinds of learning curves, what are their key determinants, and how do different problem parameters effect rate of convergence. Answering these questions is crucial not only for making useful comparisons between algorithms, but also for developing hybrid and new RL methods. In this paper we provide preliminary answers to some of the above questions for the case of absorbing Markov chains, where mean square error between the estimated and true predictions is used as the quantity of interest in learning curves.
Approximate Solutions to Optimal Stopping Problems
Tsitsiklis, John N., Roy, Benjamin Van
We propose and analyze an algorithm that approximates solutions to the problem of optimal stopping in a discounted irreducible aperiodic Markovchain. The scheme involves the use of linear combinations offixed basis functions to approximate a Q-function. The weights of the linear combination are incrementally updated through an iterative process similar to Q-Iearning, involving simulation ofthe underlying Markov chain. Due to space limitations, we only provide an overview of a proof of convergence (with probability 1)and bounds on the approximation error. This is the first theoretical result that establishes the soundness of a Q-Iearninglike algorithmwhen combined with arbitrary linear function approximators tosolve a sequential decision problem.
Analysis of Temporal-Diffference Learning with Function Approximation
Tsitsiklis, John N., Roy, Benjamin Van
The algorithm weanalyze performs online updating of a parameter vector during a single endless trajectory of an aperiodic irreducible finite state Markov chain. Results include convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to establishing new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Furthermore, we discuss the implications of two counterexamples with regards to the Significance of online updating and linearly parameterized function approximators. 1 INTRODUCTION The problem of predicting the expected long-term future cost (or reward) of a stochastic dynamic system manifests itself in both time-series prediction and control. Anexample in time-series prediction is that of estimating the net present value of a corporation, as a discounted sum of its future cash flows, based on the current state of its operations. In control, the ability to predict long-term future cost as a function of state enables the ranking of alternative states in order to guide decision-making. Indeed, such predictions constitute the cost-to-go function that is central to dynamic programming and optimal control (Bertsekas, 1995). Temporal-difference learning, originally proposed by Sutton (1988), is a method for approximating long-term future cost as a function of current state.
On-line Policy Improvement using Monte-Carlo Search
Tesauro, Gerald, Galperin, Gregory R.
Policy iteration is known to have rapid and robust convergence properties, and for Markov tasks with lookup-table state-space representations, it is guaranteed to convergence to the optimal policy. Online Policy Improvement using Monte-Carlo Search 1069 In typical uses of policy iteration, the policy improvement step is an extensive off-line procedure. For example, in dynamic programming, one performs a sweep through all states in the state space. Reinforcement learning provides another approach topolicy improvement; recently, several authors have investigated using RL in conjunction with nonlinear function approximators to represent the value functions and/orpolicies (Tesauro, 1992; Crites and Barto, 1996; Zhang and Dietterich, 1996). These studies are based on following actual state-space trajectories rather than sweeps through the full state space, but are still too slow to compute improved policies in real time.
Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning
Model learning combined with dynamic programming has been shown to be effective for learning control of continuous state dynamic systems. The simplest method assumes the learned model is correct and applies dynamic programming to it, but many approximators provide uncertainty estimates on the fit. How can they be exploited? This paper addresses the case where the system must be prevented from having catastrophic failures during learning.We propose a new algorithm adapted from the dual control literature and use Bayesian locally weighted regression models with dynamic programming.A common reinforcement learning assumption is that aggressive exploration should be encouraged. This paper addresses the converse casein which the system has to reign in exploration.
Reinforcement Learning for Mixed Open-loop and Closed-loop Control
Hansen, Eric A., Barto, Andrew G., Zilberstein, Shlomo
Closed-loop control relies on sensory feedback that is usually assumed tobe free . But if sensing incurs a cost, it may be costeffective totake sequences of actions in open-loop mode. We describe a reinforcement learning algorithm that learns to combine open-loop and closed-loop control when sensing incurs a cost. Although weassume reliable sensors, use of open-loop control means that actions must sometimes be taken when the current state of the controlled system is uncertain. This is a special case of the hidden-state problem in reinforcement learning, and to cope, our algorithm relies on short-term memory.