Reinforcement Learning
On-Line Estimation of the Optimal Value Function: HJB- Estimators
In this paper, we discuss online estimation strategies that model the optimal value function of a typical optimal control problem. We present a general strategy that uses local corridor solutions obtained via dynamic programming to provide local optimal control sequencetraining data for a neural architecture model of the optimal value function.
Explanation-Based Neural Network Learning for Robot Control
Mitchell, Tom M., Thrun, Sebastian B.
How can artificial neural nets generalize better from fewer examples? In order to generalize successfully, neural network learning methods typically require large training data sets. We introduce a neural network learning method that generalizes rationally from many fewer data points, relying instead on prior knowledge encoded in previously learned neural networks. For example, in robot control learning tasks reported here, previously learned networks that model the effects of robot actions are used to guide subsequent learning of robot control functions. For each observed training example of the target function (e.g. the robot control policy), the learner explains the observed example in terms of its prior knowledge, then analyzes this explanation to infer additional information about the shape, or slope, of the target function. This shape knowledge is used to bias generalization when learning the target function. Results are presented applying this approach to a simulated robot task based on reinforcement learning.
Using Aperiodic Reinforcement for Directed Self-Organization During Development
Montague, P. R., Dayan, P., Nowlan, S.J., Pouget, A, Sejnowski, T.J.
We present a local learning rule in which Hebbian learning is conditional on an incorrect prediction of a reinforcement signal. We propose a biological interpretation of such a framework and display its utility through examples in which the reinforcement signal is cast as the delivery of a neuromodulator to its target. Three exam pIes are presented which illustrate how this framework can be applied to the development of the oculomotor system. 1 INTRODUCTION Activity-dependent accounts of the self-organization of the vertebrate brain have relied ubiquitously on correlational (mainly Hebbian) rules to drive synaptic learning. Inthe brain, a major problem for any such unsupervised rule is that many different kinds of correlations exist at approximately the same time scales and each is effectively noise to the next. For example, relationships within and between the retinae among variables such as color, motion, and topography may mask one another and disrupt their appropriate segregation at the level of the thalamus or cortex.
Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria
The ensemble dynamics of stochastic learning algorithms can be studied using theoretical techniques from statistical physics. We develop the equations of motion for the weight space probability densities for stochastic learning algorithms. We discuss equilibria in the diffusion approximation and provide expressions for special cases of the LMS algorithm. The equilibrium densities are not in general thermal (Gibbs) distributions in the objective function being minimized,but rather depend upon an effective potential that includes diffusion effects. Finally we present an exact analytical expression for the time evolution of the density for a learning algorithm withweight updates proportional to the sign of the gradient.
Learning Control Under Extreme Uncertainty
A peg-in-hole insertion task is used as an example to illustrate the utility of direct associative reinforcement learning methods for learning control under real-world conditions of uncertainty and noise. Task complexity due to the use of an unchamfered hole and a clearance of less than 0.2mm is compounded by the presence of positional uncertainty of magnitude exceeding 10 to 50 times the clearance. Despite this extreme degree of uncertainty, our results indicate that direct reinforcement learning can be used to learn a robust reactive control strategy that results in skillful peg-in-hole insertions.
Reinforcement Learning Applied to Linear Quadratic Regulation
Recent research on reinforcement learning has focused on algorithms basedon the principles of Dynamic Programming (DP). One of the most promising areas of application for these algorithms isthe control of dynamical systems, and some impressive results have been achieved. However, there are significant gaps between practice and theory. In particular, there are no con vergence proofsfor problems with continuous state and action spaces, or for systems involving nonlinear function approximators (such as multilayer perceptrons). This paper presents research applying DPbased reinforcement learning theory to Linear Quadratic Regulation (LQR),an important class of control problems involving continuous state and action spaces and requiring a simple type of nonlinear function approximator. We describe an algorithm based on Q-Iearning that is proven to converge to the optimal controller for a large class of LQR problems. We also describe a slightly different algorithm that is only locally convergent to the optimal Q-function, demonstrating one of the possible pitfalls of using a nonlinear function approximator with DPbased learning.
Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping
Moore, Andrew W., Atkeson, Christopher G.
We present a new algorithm, Prioritized Sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as Temporal Differencing and Q-Iearning have fast real time performance. Classicalmethods are slower, but more accurate, because they make full use of the observations. Prioritized Sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamicprogramming sweeps and to guide the exploration of statespace.
Q-Learning with Hidden-Unit Restarting
Platt's resource-allocation network (RAN) (Platt, 1991a, 1991b) is modified for a reinforcement-learning paradigm and to "restart" existing hidden units rather than adding new units. After restarting, unitscontinue to learn via back-propagation. The resulting restart algorithm is tested in a Q-Iearning network that learns to solve an inverted pendulum problem. Solutions are found faster on average with the restart algorithm than without it.
Prioritized sweeping—Reinforcement learning with less data and less time
We present a new algorithm,prioritized sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as temporal differencing and Q-learning have real-time performance. Classical methods are slower, but more accurate, because they make full use of the observations. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of state-space. We compare prioritized sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems.
Tight performance bounds on greedy policies based on imperfect value functions
Williams, R. J. | Baird, L. C. I.
Reinforcement learning is an effective technique for learning action policies in discrete stochastic environments, but its efficiency can decay exponentially with the size of the state space. In many situations significant portions of a large state space may be irrelevant to a specific goal and can be aggregated into a few, relevant, states. The U Tree algorithm generates a tree based state discretization that efficiently finds the relevant state chunks of large propositional domains. In this paper, we extend the U Tree algorithm to challenging domains with a continuous state space for which there is no initial discretization.