Country
An Actor/Critic Algorithm that is Equivalent to Q-Learning
Crites, Robert H., Barto, Andrew G.
We prove the convergence of an actor/critic algorithm that is equivalent toQ-Iearning by construction. Its equivalence is achieved by encoding Q-values within the policy and value function of the actor andcritic. The resultant actor/critic algorithm is novel in two ways: it updates the critic only when the most probable action is executed from any given state, and it rewards the actor using criteria thatdepend on the relative probability of the action that was executed.
Finding Structure in Reinforcement Learning
Thrun, Sebastian, Schwartz, Anton
Reinforcement learning addresses the problem of learning to select actions in order to maximize one's performance in unknown environments. To scale reinforcement learning to complex real-world tasks, such as typically studied in AI, one must ultimately be able to discover the structure in the world, in order to abstract away the myriad of details and to operate in more tractable problem spaces. This paper presents the SKILLS algorithm. SKILLS discovers skills, which are partially defined action policies that arise in the context of multiple, related tasks.
Instance-Based State Identification for Reinforcement Learning
This paper presents instance-based state identification, an approach to reinforcement learning and hidden state that builds disambiguating amountsof short-term memory online, and also learns with an order of magnitude fewer training steps than several previous approaches. Inspiredby a key similarity between learning with hidden state and learning in continuous geometrical spaces, this approach uses instance-based (or "memory-based") learning, a method that has worked well in continuous spaces. 1 BACKGROUND AND RELATED WORK When a robot's next course of action depends on information that is hidden from the sensors because of problems such as occlusion, restricted range, bounded field of view and limited attention, the robot suffers from hidden state. More formally, we say a reinforcement learning agent suffers from the hidden state problem if the agent's state representation is non-Markovian with respect to actions and utility. The hidden state problem arises as a case of perceptual aliasing: the mapping between statesof the world and sensations of the agent is not one-to-one [Whitehead, 1992]. If the agent's perceptual system produces the same outputs for two world states in which different actions are required, and if the agent's state representation consists only of its percepts, then the agent will fail to choose correct actions.
Generalization in Reinforcement Learning: Safely Approximating the Value Function
Boyan, Justin A., Moore, Andrew W.
Reinforcement learning-the problem of getting an agent to learn to act from sparse, delayed rewards-has been advanced by techniques based on dynamic programming (DP). These algorithms compute a value function which gives, for each state, the minimum possiblelong-term cost commencing in that state. For the high-dimensional and continuous state spaces characteristic of real-world control tasks, a discrete representation ofthe value function is intractable; some form of generalization is required. A natural way to incorporate generalization into DP is to use a function approximator, rather than a lookup table, to represent the value function. This approach, which dates back to uses of Legendre polynomials in DP [Bellman et al., 19631, has recently worked well on several dynamic control problems [Mahadevan and Connell, 1990, Lin, 1993] and succeeded spectacularly on the game of backgammon [Tesauro, 1992, Boyan, 1992].
Reinforcement Learning with Soft State Aggregation
Singh, Satinder P., Jaakkola, Tommi, Jordan, Michael I.
It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. Inthis paper we address the pressing issue of combining function approximation and RL, and present 1) a function approximator basedon a simple extension to state aggregation (a commonly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.
Advantage Updating Applied to a Differential Game
Harmon, Mance E., III, Leemon C. Baird, Klopf, A. Harry
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations.
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
Jaakkola, Tommi, Singh, Satinder P., Jordan, Michael I.
Increasing attention has been paid to reinforcement learning algorithms inrecent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo policy evaluationcombined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerablybetter than any deterministic policy. Although the space of stochastic policies is continuous-even for a discrete action space-our algorithm is computationally tractable.
A Rigorous Analysis of Linsker-type Hebbian Learning
Feng, J., Pan, H., Roychowdhury, V. P.
His simulations have shown that for appropriate parameter regimes, several structured connection patterns (e.g., centre-surround and oriented afferent receptive fields (aRFs)) occur progressively as the Hebbian evolution of the weights is carried out layer by layer. The behavior of Linsker's model is determined by the underlying nonlinear dynamics which are parameterized by a set of parameters originating from the Hebbian rule and the arbor density of the synapses.
Learning Stochastic Perceptrons Under k-Blocking Distributions
Marchand, Mario, Hadjifaradji, Saeed
Such distributions represent an important stepbeyond the case where each input variable is statistically independent since the 2k-blocking family contains all the Markov distributions of order k. By stochastic perceptron we mean a perceptron which,upon presentation of input vector x, outputs 1 with probability fCLJi WiXi - B).
Temporal Dynamics of Generalization in Neural Networks
Wang, Changfeng, Venkatesh, Santosh S.
This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, towit that amongst machines with similar training performance one should opt for the machine of least complexity.