Technology
Exploring Unknown Environments with Real-Time Search or Reinforcement Learning
Learning Real-Time A* (LRTA*) is a popular control method that interleaves planning andplan execution and has been shown to solve search problems in known environments efficiently. In this paper, we apply LRTA* to the problem of getting to a given goal location in an initially unknown environment. Uninformed LRTA* with maximal lookahead always moves on a shortest path to the closest unvisited state, that is, to the closest potential goal state. This was believed to be a good exploration heuristic, but we show that it does not minimize the worst-case plan-execution time compared to other uninformed exploration methods. This result is also of interest to reinforcement-learning researchers since many reinforcement learning methods use asynchronous dynamic programming, interleave planning and plan execution, and exhibit optimism in the face of uncertainty, just like LRTA*.
Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms
Kearns, Michael J., Singh, Satinder P.
In this paper, we address two issues of longstanding interest in the reinforcement learningliterature. First, what kinds of performance guarantees can be made for Q-learning after only a finite number of actions? Second, what quantitative comparisons can be made between Q-learning and model-based (indirect) approaches, which use experience to estimate next-state distributions for off-line value iteration? We first show that both Q-learning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the number ofstate transitions observed.
Viewing Classifier Systems as Model Free Learning in POMDPs
Hayashi, Akira, Suematsu, Nobuo
Classifier systems are now viewed disappointing because of their problems suchas the rule strength vs rule set performance problem and the credit assignment problem. In order to solve the problems, we have developed ahybrid classifier system: GLS (Generalization Learning System). In designing GLS, we view CSs as model free learning in POMDPs and take a hybrid approach to finding the best generalization, given the total number of rules. GLS uses the policy improvement procedure by Jaakkola et al. for an locally optimal stochastic policy when a set of rule conditions is given. GLS uses GA to search for the best set of rule conditions. 1 INTRODUCTION Classifier systems (CSs) (Holland 1986) have been among the most used in reinforcement learning.
Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning
Brown, Timothy X., Tong, Hui, Singh, Satinder P.
This paper examines the application of reinforcement learning to a telecommunications networking problem. The problem requires that revenue bemaximized while simultaneously meeting a quality of service constraint that forbids entry into certain states. We present a general solution to this multi-criteria problem that is able to earn significantly higher revenues than alternatives.
Gradient Descent for General Reinforcement Learning
III, Leemon C. Baird, Moore, Andrew W.
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcementlearning algorithms.These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows policysearch andvalue-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (V APS) algorithm.
Using Collective Intelligence to Route Internet Traffic
Wolpert, David, Tumer, Kagan, Frank, Jeremy
A COllective INtelligence (COIN) is a set of interacting reinforcement learning(RL) algorithms designed in an automated fashion so that their collective behavior optimizes a global utility function. We summarize the theory of COINs, then present experiments using thattheory to design COINs to control internet traffic routing. These experiments indicate that COINs outperform all previously investigated RL-based, shortest path routing algorithms. 1 INTRODUCTION COllective INtelligences (COINs) are large, sparsely connected recurrent neural networks, whose "neurons" are reinforcement learning (RL) algorithms. The distinguishing featureof COINs is that their dynamics involves no centralized control, but only the collective effects of the individual neurons each modifying their behavior viatheir individual RL algorithms. This restriction holds even though the goal of the COIN concerns the system's global behavior.
Robot Docking Using Mixtures of Gaussians
Williamson, Matthew M., Murray-Smith, Roderick, Hansen, Volker
This paper applies the Mixture of Gaussians probabilistic model, combined withExpectation Maximization optimization to the task of summarizing threedimensional range data for a mobile robot. This provides a flexible way of dealing with uncertainties in sensor information, and allows theintroduction of prior knowledge into low-level perception modules. Problemswith the basic approach were solved in several ways: the mixture of Gaussians was reparameterized to reflect the types of objects expected in the scene, and priors on model parameters were included in the optimization process. Both approaches force the optimization to find'interesting' objects, given the sensor and object characteristics. A higher level classifier was used to interpret the results provided by the model, and to reject spurious solutions.
Graphical Models for Recognizing Human Interactions
Oliver, Nuria, Rosario, Barbara, Pentland, Alex
We describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in two different scenarios: (1) complex, twohanded actionrecognition in the martial art of Tai Chi and (2) detection and recognition of individual human behaviors and multiple-person interactions in a visual surveillance task. In the latter case, the system is particularly concerned with detecting when interactions between people occur, and classifying them. Graphical models, such as Hidden Markov Models (HMMs) [6] and Coupled Hidden MarkovModels (CHMMs) [3, 2], seem appropriate for modeling and, classifying human behaviors because they offer dynamic time warping, a well-understood training algorithm, and a clear Bayesian semantics for both individual (HMMs) and interacting or coupled (CHMMs) generative processes. A major problem with this data-driven statistical approach, especially when modeling rare or anomalous behaviors, is the limited number of training examples. A major emphasis of our work, therefore, is on efficient Bayesian integration of both prior knowledge with evidence from data.