Fuzzy Logic
Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning
We address two open theoretical questions in Policy Gradient Reinforce- ment Learning. The first concerns the efficacy of using function approx- imation to represent the state action value function, . Theory is pre- sented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of basis functions in the function approximation representation. The sec- ond concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by is the number of possible actions.
Batch Value Function Approximation via Support Vectors
We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formu(cid:173) lations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradi(cid:173) ent methods, the kernel methods described here can easily'adjust the complexity of the function approximator to fit the complexity of the value function.
Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field
It appears that in constructing the motor commands to guide the arm toward a target, the brain relies on an internal model (IM) of the dynamics of the task that it learns through practice [1]. The IM is presumably a system that transforms a desired limb trajectory in sensory coordinates to motor commands. The motor commands in turn create the complex activation of muscles necessary to cause action. A major issue in motor control is to infer characteristics of the IM from the actions of subjects. Recently, we took a first step toward mathematically characterizing the IM's rep- resentation in the brain [2]. We analyzed the sequence of errors made by subjects on successive movements as they reached to targets while holding a robotic arm. The robot produced a force field and subjects learned to compensate for the field (presumably by constructing an IM) and eventually produced straight movements within the field. Our analysis sought to draw conclusions about the structure of the IM from the sequence of errors generated by the subjects.
A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
We establish a new hardness result that shows that the difficulty of plan- ning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of fac- tored MDPs with linear rewards whose optimal policies and value func- tions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connectionswith the com- plexity class of Arthur-Merlin games.
Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. How(cid:173) ever, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. This allows to analyse if a certain reinforcement learning algorithm and a cer(cid:173) tain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transi(cid:173) tion data.
Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
There are several reinforcement learning algorithms that yield ap(cid:173) proximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution.
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
We investigate the problem of automatically constructing efficient rep- resentations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particu- lar, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigen- functions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying repre- sentation and policies are simultaneously learned.
Simplifying Mixture Models through Function Approximation
Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the dis- tance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensional- ity.
Robust Value Function Approximation Using Bilinear Programming
Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, it provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP hard, but this is unavoidable because the Bellman-residual minimization itself is NP hard. We, therefore, employ and analyze a common approximate algorithm for bilinear programs.
Convergent Fitted Value Iteration with Linear Function Approximation
Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, "Expansion-Constrained Ordinary Least Squares" (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the infinity-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of "averagers," we prove a minimax property of the ECOLS residual error, and we give an efficient algorithm for computing the coefficients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties.