Reinforcement Learning
Scaling Up Robust MDPs by Reinforcement Learning
Tamar, Aviv, Xu, Huan, Mannor, Shie
We consider large-scale Markov decision processes (MDPs) with parameter uncertainty, under the robust MDP paradigm. Previous studies showed that robust MDPs, based on a minimax approach to handle uncertainty, can be solved using dynamic programming for small to medium sized problems. However, due to the "curse of dimensionality", MDPs that model real-life problems are typically prohibitively large for such approaches. In this work we employ a reinforcement learning approach to tackle this planning problem: we develop a robust approximate dynamic programming method based on a projected fixed point equation to approximately solve large scale robust MDPs. We show that the proposed method provably succeeds under certain technical conditions, and demonstrate its effectiveness through simulation of an option pricing problem. To the best of our knowledge, this is the first attempt to scale up the robust MDPs paradigm.
Direct Uncertainty Estimation in Reinforcement Learning
Rodionov, Sergey, Potapov, Alexey, Vinogradov, Yurii
Absence of prior knowledge about the environment (absence of its precise model) is naturally characterized by the intuitive notion of uncertainty. However, no generally accepted accurate formal description of this notion exists. Probability theory is the most traditional way of describing uncertainty, but adequate interpretation of probability itself is not that clear. This can be seen from numerous paradoxes in probability theory, such as the grue emerald paradox. That is why some attempts of extending probability theory were made. The most well-known one is fuzzy set theory. However, fuzzy operations can be considered as probabilistic operations with some additional assumptions about operands (e.g.
The Arcade Learning Environment: An Evaluation Platform for General Agents
Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, Bowling, Michael
In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
The Arcade Learning Environment: An Evaluation Platform for General Agents
Bellemare, M. G., Naddaf, Y., Veness, J., Bowling, M.
In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
Temporal-Difference Search in Computer Go
Silver, David (University College London) | Sutton, Richard (University of Alberta) | Mueller, Martin (University of Alberta)
Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning. Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.
Analysis of Watson's Strategies for Playing Jeopardy!
Tesauro, G., Gondek, D. C., Lenchner, J., Fan, J., Prager, J. M.
Major advances in Question Answering technology were needed for IBM Watson to play Jeopardy! at championship level -- the show requires rapid-fire answers to challenging natural language questions, broad general knowledge, high precision, and accurate confidence estimates. In addition, Jeopardy! features four types of decision making carrying great strategic importance: (1) Daily Double wagering; (2) Final Jeopardy wagering; (3) selecting the next square when in control of the board; (4) deciding whether to attempt to answer, i.e., "buzz in." Using sophisticated strategies for these decisions, that properly account for the game state and future event probabilities, can significantly boost a player's overall chances to win, when compared with simple "rule of thumb" strategies. This article presents our approach to developing Watson's game-playing strategies, comprising development of a faithful simulation model, and then using learning and Monte-Carlo methods within the simulator to optimize Watson's strategic decision-making. After giving a detailed description of each of our game-strategy algorithms, we then focus in particular on validating the accuracy of the simulator's predictions, and documenting performance improvements using our methods. Quantitative performance benefits are shown with respect to both simple heuristic strategies, and actual human contestant performance in historical episodes. We further extend our analysis of human play to derive a number of valuable and counterintuitive examples illustrating how human contestants may improve their performance on the show.
Reinforcement Learning for the Soccer Dribbling Task
Carvalho, Arthur, Oliveira, Renato
We propose a reinforcement learning solution to the \emph{soccer dribbling task}, a scenario in which a soccer agent has to go from the beginning to the end of a region keeping possession of the ball, as an adversary attempts to gain possession. While the adversary uses a stationary policy, the dribbler learns the best action to take at each decision point. After defining meaningful variables to represent the state space, and high-level macro-actions to incorporate domain knowledge, we describe our application of the reinforcement learning algorithm \emph{Sarsa} with CMAC for function approximation. Our experiments show that, after the training period, the dribbler is able to accomplish its task against a strong adversary around 58% of the time.
Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
Ibrahimi, Morteza, Javanmard, Adel, Van Roy, Benjamin
We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average cost LQ problem, a regret bound of ${O}(\sqrt{T})$ was shown, apart form logarithmic factors. However, this bound scales exponentially with $p$, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that achieves a regret bound of ${O}(p \sqrt{T})$, apart from logarithmic factors. In particular, our algorithm has an average cost of $(1+\eps)$ times the optimum cost after $T = \polylog(p) O(1/\eps^2)$. This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of $\eps$ times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks.
Automatic Abstraction in Reinforcement Learning Using Ant System Algorithm
Ghafoorian, Mohsen (Sharif University of Technology) | Taghizadeh, Nasrin (Sharif University of Technology) | Beigy, Hamid (Sharif University of Technology)
Nowadays developing autonomous systems, which can act in various environments and interactively perform their assigned tasks, are intensively desirable. These systems would be ready to be applied in different fields such as medicine, controller robots and social life. Reinforcement learning is an attractive area of machine learning which addresses these concerns. In large scales, learning performance of an agent can be improved by using hierarchical Reinforcement Learning techniques and temporary extended actions. The higher level of abstraction helps the learning agent approach lifelong learning goals. In this paper a new method is presented for discovering subgoal states and constructing useful skills. The method utilizes Ant System optimization algorithm to identify bottleneck edges, which act like bridges between different connected areas of the problem space. Using discovered subgoals, the agent creates temporal abstractions, which enable it to explore more effectively. Experimental Results show that the proposed method can significantly improve the learning performance of the agent.
Learning Sensorimotor Concepts Without Reinforcement
Mohammad, Yasser F. O. (Kyoto University) | Nishida, Toyoaki (Kyoto University)
Agents engaged in lifelong learning can benefit from the ability to acquire new concepts from continuous interaction with objects in their environments which is a ubiquitous ability in humans. This paper advocates the use of sensorimotor concepts that combine perceptual and actuation patterns.Related representations to sensorimotor concepts are Predictive State Representation in dynamical systems, Affordance Based Concepts in language and Skills in reinforcement learning. The paper proposes a system for learning generalized sensorimotor concepts from unsegmented interactions between the agent and the objects in its environment that works in continuous action and observation spaces and in the same time require no reinforcement signals. A proof-of-concept experiment with the proposed system on a simulated e-puck robot is reported to support the applicability of the proposed approach.