Reinforcement Learning
Motivated Reinforcement Learning
Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine,12,13 into the neurobiology and psychology ofmotivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actionsat a given state, but rather whether a single response isworth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological andneural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.
Stabilizing Value Function Approximation with the BFBP Algorithm
Wang, Xin, Dietterich, Thomas G.
Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms.
Direct value-approximation for factored MDPs
Schuurmans, Dale, Patrascu, Relu
We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal valuefunction can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates thebest linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reductionin computation time over approximate value-and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation errorfor the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1 Introduction Markov decision processes (MDPs) form a foundation for control in uncertain and stochastic environments and reinforcement learning. Standard methods such as value-iteration, policy-iteration and linear programming can be used to produce optimal control policies for MDPs that are expressed in explicit form; that is, the policy, value function and state transition model are all represented in a tabular manner that explicitly enumerates the state space.
The Steering Approach for Multi-Criteria Reinforcement Learning
We consider the problem of learning to attain multiple goals in a dynamic environment, whichis initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, inan appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficientconditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well.
Model-Free Least-Squares Policy Iteration
Lagoudakis, Michail G., Parr, Ronald
We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states.
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Greensmith, Evan, Bartlett, Peter L., Baxter, Jonathan
We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems.The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths,to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance forany baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal valuefunction gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions.
Convergence of Optimistic and Incremental Q-Learning
Even-dar, Eyal, Mansour, Yishay
The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an E optimal policy. The second is a new and novel algorithm incremental Q-learning,which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm canbe viewed as derandomization of the E-greedy Q-learning. 1 Introduction One of the challenges of Reinforcement Learning is learning in an unknown environment.
Batch Value Function Approximation via Support Vectors
Dietterich, Thomas G., Wang, Xin
One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attemptto minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods,the kernel methods described here can easily'adjust the complexity of the function approximator to fit the complexity of the value function.
Cobot: A Social Reinforcement Learning Agent
Jr., Charles Lee Isbell, Shelton, Christian R.
We report on the use of reinforcement learning with Cobot, a software agent residing in the well-known online community LambdaMOO. Our initial work on Cobot (Isbell et al.2000) provided him with the ability to collect social statistics and report them to users. Here we describe an application of RL allowing Cobot to take proactive actions in this complex social environment, and adapt behavior from multiple sources of human reward. After 5 months of training, and 3171 reward and punishment events from 254 different LambdaMOO users, Cobot learned nontrivial preferences for a number of users, modifing his behavior based on his current state. Here we describe LambdaMOO and the state and action spaces of Cobot, and report the statistical results of the learning experiment.