Reinforcement Learning
Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning
Brown, Timothy X., Tong, Hui, Singh, Satinder P.
This paper examines the application of reinforcement learning to a telecommunications networking problem. The problem requires that revenue be maximized while simultaneously meeting a quality of service constraint that forbids entry into certain states. We present a general solution to this multi-criteria problem that is able to earn significantly higher revenues than alternatives.
Gradient Descent for General Reinforcement Learning
III, Leemon C. Baird, Moore, Andrew W.
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcementlearning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows policysearch and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (V APS) algorithm.
Robust, Efficient, Globally-Optimized Reinforcement Learning with the Parti-Game Algorithm
Al-Ansari, Mohammad A., Williams, Ronald J.
Parti-game (Moore 1994a; Moore 1994b; Moore and Atkeson 1995) is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming the curse of dimensionality that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications to the algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner. 1 INTRODUCTION Parti-game operates on goal problems by dynamically partitioning the space into hyperrectangular cells of varying sizes, represented using a k-d tree data structure. It assumes the existence of a pre-specified local controller that can be commanded to proceed from the current state to a given state. The algorithm uses a game-theoretic approach to assign costs to cells based on past experiences using a minimax algorithm.
Using Collective Intelligence to Route Internet Traffic
Wolpert, David, Tumer, Kagan, Frank, Jeremy
A COllective INtelligence (COIN) is a set of interacting reinforcement learning (RL) algorithms designed in an automated fashion so that their collective behavior optimizes a global utility function. We summarize the theory of COINs, then present experiments using that theory to design COINs to control internet traffic routing. These experiments indicate that COINs outperform all previously investigated RL-based, shortest path routing algorithms. 1 INTRODUCTION COllective INtelligences (COINs) are large, sparsely connected recurrent neural networks, whose "neurons" are reinforcement learning (RL) algorithms. The distinguishing feature of COINs is that their dynamics involves no centralized control, but only the collective effects of the individual neurons each modifying their behavior via their individual RL algorithms. This restriction holds even though the goal of the COIN concerns the system's global behavior.
Reinforcement Learning for Trading
Moody, John E., Saffell, Matthew
In this paper, we propose to use recurrent reinforcement learning to directly optimize such trading system performance functions, and we compare two different reinforcement learning methods. The first, Recurrent Reinforcement Learning, uses immediate rewards to train the trading systems, while the second (Q-Learning (Watkins 1989)) approximates discounted future rewards. These methodologies can be applied to optimizing systems designed to trade a single security or to trade portfolios . In addition, we propose a novel value function for risk-adjusted return that enables learning to be done online: the differential Sharpe ratio. Trading system profits depend upon sequences of interdependent decisions, and are thus path-dependent. Optimal trading decisions when the effects of transactions costs, market impact and taxes are included require knowledge of the current system state. In Moody, Wu, Liao & Saffell (1998), we demonstrate that reinforcement learning provides a more elegant and effective means for training trading systems when transaction costs are included, than do more standard supervised approaches.
Scheduling Straight-Line Code Using Reinforcement Learning and Rollouts
McGovern, Amy, Moss, J. Eliot B.
In 1986, Tanner and Mead [1] implemented an interesting constraint satisfaction circuit for global motion sensing in a VLSI. We report here a new and improved a VLSI implementation that provides smooth optical flow as well as global motion in a two dimensional visual field. The computation of optical flow is an ill-posed problem, which expresses itself as the aperture problem. However, the optical flow can be estimated by the use of regularization methods, in which additional constraints are introduced in terms of a global energy functional that must be minimized. We show how the algorithmic constraints of Hom and Schunck [2] on computing smooth optical flow can be mapped onto the physical constraints of an equivalent electronic network.
Risk Sensitive Reinforcement Learning
Neuneier, Ralph, Mihatsch, Oliver
A directed generative model for binary data using a small number of hidden continuous units is investigated. The relationships between the correlations of the underlying continuous Gaussian variables and the binary output variables are utilized to learn the appropriate weights of the network. The advantages of this approach are illustrated on a translationally invariant binary distribution and on handwritten digit images. Introduction Principal Components Analysis (PCA) is a widely used statistical technique for representing data with a large number of variables [1]. It is based upon the assumption that although the data is embedded in a high dimensional vector space, most of the variability in the data is captured by a much lower climensional manifold. In particular for PCA, this manifold is described by a linear hyperplane whose characteristic directions are given by the eigenvectors of the correlation matrix with the largest eigenvalues. The success of PCA and closely related techniques such as Factor Analysis (FA) and PCA mixtures clearly indicate that much real world data exhibit the low dimensional manifold structure assumed by these models [2, 3].
Barycentric Interpolators for Continuous Space and Time Reinforcement Learning
Munos, Rรฉmi, Moore, Andrew W.
In order to find the optimal control of continuous state-space and time reinforcement learning (RL) problems, we approximate the value function (VF) with a particular class of functions called the barycentric interpolators. We establish sufficient conditions under which a RL algorithm converges to the optimal VF, even when we use approximate models of the state dynamics and the reinforcement functions.
A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory
Suematsu, Nobuo, Hayashi, Akira
Since BLHT learns a stochastic model based on Bayesian Learning, the overfitting problemis reasonably solved. Moreover, BLHT has an efficient implementation. This paper shows that the model learned by BLHT converges toone which provides the most accurate predictions of percepts and rewards, given short-term memory. 1 INTRODUCTION Research on Reinforcement Learning (RL) problem forpartially observable environments is gaining more attention recently. This is mainly because the assumption that perfect and complete perception of the state of the environment is available for the learning agent, which many previous RL algorithms require, is not valid for many realistic environments.
Robust, Efficient, Globally-Optimized Reinforcement Learning with the Parti-Game Algorithm
Al-Ansari, Mohammad A., Williams, Ronald J.
Parti-game (Moore 1994a; Moore 1994b; Moore and Atkeson 1995) is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming thecurse of dimensionality that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications tothe algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner. 1 INTRODUCTION Parti-game operates on goal problems by dynamically partitioning the space into hyperrectangular cellsof varying sizes, represented using a k-d tree data structure. It assumes the existence of a pre-specified local controller that can be commanded to proceed from the current state to a given state. The algorithm uses a game-theoretic approach to assign costs to cells based on past experiences using a minimax algorithm.