Reinforcement Learning
Conditions for Stability and Convergence of Set-Valued Stochastic Approximations: Applications to Approximate Value and Fixed point Iterations
Ramaswamy, Arunselvan, Bhatnagar, Shalabh
The main aim of this paper is the development of easily verifiable sufficient conditions for stability (almost sure boundedness) and convergence of stochastic approximation algorithms (SAAs) with set-valued mean-fields, a class of model-free algorithms that have become important in recent times. In this paper we provide a complete analysis of such algorithms under three different, yet related sets of sufficient conditions, based on the existence of an associated global/local Lyapunov function. Unlike previous Lyapunov function based approaches, we provide a simple recipe for explicitly constructing the Lyapunov function, needed for analysis. Our work builds on the works of Abounadi, Bertsekas and Borkar (2002), Munos (2005), and Ramaswamy and Bhatnagar (2016). An important motivation for the flavor of our assumptions comes from the need to understand dynamic programming and reinforcement learning algorithms, that use deep neural networks (DNNs) for function approximations and parameterizations. These algorithms are popularly known as deep learning algorithms. As an important application of our theory, we provide a complete analysis of the stochastic approximation counterpart of approximate value iteration (AVI), an important dynamic programming method designed to tackle Bellman's curse of dimensionality. Further, the assumptions involved are significantly weaker, easily verifiable and truly model-free. The theory presented in this paper is also used to develop and analyze the first SAA for finding fixed points of contractive set-valued maps.
AI supercomputer creates its own 'AI child' that can outperform man-made rivals
A GOOGLE supercomputer has created an "AI child" which can outperform its man-made rivals. The incredible machine named NASNet becomes smarter through "reinforcement learning" which sees it report back to its "parent" computer when completing tasks. The AI (artificial intelligence), which was created earlier this year, is able to recognise objects such as people and cars while watching real time video. NASNet is controlled by a neural network called AutoML which was created by humans at Google Brain. The parent AI teaches its offspring to do specific tasks which are repeated thousands of times.
[R] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm โข r/MachineLearning
One thing I was curious about is whether AlphaZero can play endgames. For example, a friend brought up whether AlphaZero could learn how to play Nim. For anybody who isn't familiar: https://en.wikipedia.org/wiki/Nim, the optimal strategy for Nim involves computing the xor of all the heap sizes. I thought no, largely due to the lack of gradient information/lack of structure/MCTS not being a good heuristic for the quality of the move. However, this game of Nim doesn't seem that different from say, a knight-bishop end game mating scenario for chess.
A Novel Model for Arbitration between Planning and Habitual Control Systems
Fard, Farzaneh S., Trappenberg, Thomas P.
It is well established that humans decision making and instrumental control uses multiple systems, some which use habitual action selection and some which require deliberate planning. Deliberate planning systems use predictions of action-outcomes using an internal model of the agent's environment, while habitual action selection systems learn to automate by repeating previously rewarded actions. Habitual control is computationally efficient but may be inflexible in changing environments. Conversely, deliberate planning may be computationally expensive, but flexible in dynamic environments. This paper proposes a general architecture comprising both control paradigms by introducing an arbitrator that controls which subsystem is used at any time. This system is implemented for a target-reaching task with a simulated two-joint robotic arm that comprises a supervised internal model and deep reinforcement learning. Through permutation of target-reaching conditions, we demonstrate that the proposed is capable of rapidly learning kinematics of the system without a priori knowledge, and is robust to (A) changing environmental reward and kinematics, and (B) occluded vision. The arbitrator model is compared to exclusive deliberate planning with the internal model and exclusive habitual control instances of the model. The results show how such a model can harness the benefits of both systems, using fast decisions in reliable circumstances while optimizing performance in changing environments. In addition, the proposed model learns very fast. Keywords: Machine Learning, Reinforcement Learning, Supervised Learning, Habitual controller, Planning, Internal Models, Decision Making 1. Introduction Much of the current reinforcement learning (RL) literature is in the domain of model-free control. Such a learning agent learns a value function from interacting with the environment, usually updating a proposed value function from a temporal difference between the previous expectation and a new experience [1, 2]. The value function is like a big lookup-table that can quickly supply evaluations for possible actions and hence provide guidance for actions in a fast and somewhat automated way. Such a decision system can be characterized as habitual. While habitual action selection takes time to learn and requires that similar previous situations have been encountered sufficiently, the advantage is that decisions and correspondingly actions can be generated in a timely manner. In contrast, a system that has some internal models of the environment can be used to derive a value function on demand for a specific situation.
Google supercomputer creates its own 'AI child'
Google researchers have created an'AI child' that can outperform its human-made counterparts. The machine learns through'reinforcement learning' which means it trains for a task, reports back to its AI'parent' and then learns how it can do it better. The creation of this AI child is proof some machine-made programmes are now more accurate than ones created by humans. Google has created an AI'child' that can outperform its human counterparts. The machine learns through'reinforcement learning' which means it trains for a task, reports back to its AI'parent' and then learns how it can do it better.
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning
Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, Abbeel, Pieter
Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.
Learning a Generative Model for Validity in Complex Discrete Structures
Janz, David, van der Westhuizen, Jos, Paige, Brooks, Kusner, Matt J., Hernandez-Labato, Jose Miguel
Deep generative models have been successfully used to learn representations for high-dimensional discrete spaces by representing discrete objects as sequences, for which powerful sequence-based deep models can be employed. Unfortunately, these techniques are significantly hindered by the fact that these generative models often produce invalid sequences: sequences which do not represent any underlying discrete structure. As a step towards solving this problem, we propose to learn a deep recurrent validator model, which can estimate whether a partial sequence can function as the beginning of a full, valid sequence. This model not only discriminates between valid and invalid sequences, but also provides insight as to how individual sequence elements influence the validity of the overall sequence, and the existence of a corresponding discrete object. To learn this model we propose a reinforcement learning approach, where an oracle which can evaluate validity of complete sequences provides a sparse reward signal. We believe this is a key step toward learning generative models that faithfully produce valid sequences which represent discrete objects. We demonstrate its effectiveness in evaluating the validity of Python 3 source code for mathematical expressions, and improving the ability of a variational autoencoder trained on SMILES strings to decode valid molecular structures.
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
Guo, Zhaohan Daniel, Thomas, Philip S., Brunskill, Emma
Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems. In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling. We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains.