Technology
Convergence of Stochastic Iterative Dynamic Programming Algorithms
Jaakkola, Tommi, Jordan, Michael I., Singh, Satinder P.
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DPbased learning algorithms to the powerful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD("\) and Q-Iearning belong. 1 INTRODUCTION Learning to predict the future and to find an optimal way of controlling it are the basic goals of learning systems that interact with their environment. A variety of algorithms are currently being studied for the purposes of prediction and control in incompletely specified, stochastic environments. Here we consider learning algorithms defined in Markov environments. There are actions or controls (u) available for the learner that affect both the state transition probabilities, and the probability distribution for the immediate, state dependent costs (Ci(u)) incurred by the learner.
Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms
Gullapalli, Vijaykumar, Barto, Andrew G.
Reinforcement Learning methods based on approximating dynamic programming (DP) are receiving increased attention due to their utility in forming reactive control policies for systems embedded in dynamic environments. Environments are usually modeled as controlled Markov processes, but when the environment model is not known a priori, adaptive methods are necessary. Adaptive control methods are often classified as being direct or indirect. Direct methods directly adapt the control policy from experience, whereas indirect methods adapt a model of the controlled process and compute control policies based on the latest model. Our focus is on indirect adaptive DPbased methods in this paper. We present a convergence result for indirect adaptive asynchronous value iteration algorithms for the case in which a lookup table is used to store the value function. Our result implies convergence of several existing reinforcement learning algorithms such as adaptive real-time dynamic programming (ARTDP) (Barto, Bradtke, & Singh, 1993) and prioritized sweeping (Moore & Atkeson, 1993). Although the emphasis of researchers studying DPbased reinforcement learning has been on direct adaptive methods such as Q-Learning (Watkins, 1989) and methods using TD algorithms (Sutton, 1988), it is not clear that these direct methods are preferable in practice to indirect methods such as those analyzed in this paper.
Monte Carlo Matrix Inversion and Reinforcement Learning
We describe the relationship between certain reinforcement learning (RL) methods based on dynamic programming (DP) and a class of unorthodox Monte Carlo methods for solving systems of linear equations proposed in the 1950's. These methods recast the solution of the linear system as the expected value of a statistic suitably defined over sample paths of a Markov chain. The significance of our observations lies in arguments (Curtiss, 1954) that these Monte Carlo methods scale better with respect to state-space size than do standard, iterative techniques for solving systems of linear equations. This analysis also establishes convergence rate estimates. Because methods used in RL systems for approximating the evaluation function of a fixed control policy also approximate solutions to systems of linear equations, the connection to these Monte Carlo methods establishes that algorithms very similar to TD algorithms (Sutton, 1988) are asymptotically more efficient in a precise sense than other methods for evaluating policies. Further, all DPbased RL methods have some of the properties of these Monte Carlo algorithms, which suggests that although RL is often perceived to be slow, for sufficiently large problems, it may in fact be more efficient than other known classes of methods capable of producing the same results.
Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach
Boyan, Justin A., Littman, Michael L.
The field of reinforcement learning has grown dramatically over the past several years, but with the exception of backgammon [8, 2], has had few successful applications to large-scale, practical tasks. This paper demonstrates that the practical task of routing packets through a communication network is a natural application for reinforcement learning algorithms.
Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming
Dynamic programming provides a methodology to plan trajectories and design controllers and estimators for nonlinear systems. However, general dynamic programming is computationally intractable. We have developed procedures that allow more complex planning problems to be solved. We have modified the State Increment Dynamic Programming approach of Larson (1968) in several ways: 1. In State Increment DP, a constant action is integrated to form a trajectory segment from the center of a cell to its boundary. We use second order local trajectory optimization (Differential Dynamic Programming) to generate an optimal trajectory and form an optimal policy in a tube surrounding the optimal trajectory within a cell. The trajectory segment and local policy are globally optimal, up to the resolution of the representation of the value function on the boundary of the cell.
Robust Reinforcement Learning in Motion Planning
Singh, Satinder P., Barto, Andrew G., Grupen, Roderic, Connolly, Christopher
While exploring to find better solutions, an agent performing online reinforcement learning (RL) can perform worse than is acceptable. In some cases, exploration might have unsafe, or even catastrophic, results, often modeled in terms of reaching'failure' states of the agent's environment. This paper presents a method that uses domain knowledge to reduce the number of failures during exploration. This method formulates the set of actions from which the RL agent composes a control policy to ensure that exploration is conducted in a policy space that excludes most of the unacceptable policies. The resulting action set has a more abstract relationship to the task being solved than is common in many applications of RL. Although the cost of this added safety is that learning may result in a suboptimal solution, we argue that this is an appropriate tradeoff in many problems. We illustrate this method in the domain of motion planning. "'This work was done while the first author was finishing his Ph.D in computer science at the University of Massachusetts, Amherst.
Exploiting Chaos to Control the Future
Flake, Gary W., Sun, Guo-Zhen, Lee, Yee-Chun
Recently, Ott, Grebogi and Yorke (OGY) [6] found an effective method to control chaotic systems to unstable fixed points by using only small control forces; however, OGY's method is based on and limited to a linear theory and requires considerable knowledge of the dynamics of the system to be controlled. In this paper we use two radial basis function networks: one as a model of an unknown plant and the other as the controller. The controller is trained with a recurrent learning algorithm to minimize a novel objective function such that the controller can locate an unstable fixed point and drive the system into the fixed point with no a priori knowledge of the system dynamics. Our results indicate that the neural controller offers many advantages over OGY's technique.
Synchronization, oscillations, and 1/f noise in networks of spiking neurons
Stemmler, Martin, Usher, Marius, Koch, Christof, Olami, Zeev
The model consists of a two-dimensional sheet of leaky integrateand-fire neurons with feedback connectivity consisting of local excitation and surround inhibition. Each neuron is independently driven by homogeneous external noise. Spontaneous symmetry breaking occurs, resulting in the formation of "hotspots" of activity in the network. These localized patterns of excitation appear as clusters that coalesce, disintegrate, or fluctuate in size while simultaneously moving in a random walk constrained by the interaction with other clusters. The emergent cross-correlation functions have a dual structure, with a sharp peak around zero on top of a much broader hill.
An Analog VLSI Model of Central Pattern Generation in the Leech
The biological network is small and relatively well understood, and the silicon model can therefore span three levels of organization in the leech nervous system (neuron, ganglion, system); it represents one of the first comprehensive models of leech swimming operating in real-time. The circuit employs biophysically motivated analog neurons networked to form multiple biologically inspired silicon ganglia. These ganglia are coupled using known interganglionic connections. Thus the model retains the flavor of its biological counterpart, and though simplified, the output of the silicon circuit is similar to the output of the leech swim central pattern generator. The model operates on the same time-and spatial-scale as the leech nervous system and will provide an excellent platform with which to explore real-time adaptive locomotion in the leech and other "simple" invertebrate nervous systems.