Reinforcement Learning
New Trend of Data Science: Reinforcement Learning Analytics Insight
Reinforcement Learning (RL) is a machine learning method that empowers a specialist to learn in an intuitive environment by performing trial and error utilizing observations from its very own activities and encounters. In spite of the fact that both direct and reinforcement learning use mapping among input and output, not at all like supervised learning where input gave to the specialist is basically the right set of activities for playing out a task, reinforcement learning utilizes prizes and discipline as signs for positive and negative conduct. When compared with unsupervised learning, reinforcement learning is distinctive as far as objectives are taken into consideration. While the objective in unsupervised learning is to discover synonymities and contrasts between data points, in reinforcement learning the objective is to locate a reasonable activity model that would boost the aggregate total reward of the specialist. Reinforcement learning will be a huge thing in Data science in 2019. While RL has been around for quite a while in the scholarly world, it has barely observed any industry adoption whatsoever.
TD-Regularized Actor-Critic Methods
Parisi, Simone, Tangkaratt, Voot, Peters, Jan, Khan, Mohammad Emtiyaz
Actor-critic methods can achieve incredible performance on difficult reinforcement learning problems, but they are also prone to instability. This is partly due to the interaction between the actor and critic during learning, e.g., an inaccurate step taken by one of them might adversely affect the other and destabilize the learning. To avoid such issues, we propose to regularize the learning objective of the actor by penalizing the temporal difference (TD) error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate. The resulting method, which we call the TD-regularized actor-critic method, is a simple plug-and-play approach to improve stability and overall performance of the actor-critic methods. Evaluations on standard benchmarks confirm this.
Escape Room: A Configurable Testbed for Hierarchical Reinforcement Learning
Recent successes in Reinforcement Learning have encouraged a fast-growing network of RL researchers and a number of breakthroughs in RL research. As the RL community and the body of RL work grows, so does the need for widely applicable benchmarks that can fairly and effectively evaluate a variety of RL algorithms. This need is particularly apparent in the realm of Hierarchical Reinforcement Learning (HRL). While many existing test domains may exhibit hierarchical action or state structures, modern RL algorithms still exhibit great difficulty in solving domains that necessitate hierarchical modeling and action planning, even when such domains are seemingly trivial. These difficulties highlight both the need for more focus on HRL algorithms themselves, and the need for new testbeds that will encourage and validate HRL research. Existing HRL testbeds exhibit a Goldilocks problem; they are often either too simple (e.g. Taxi) or too complex (e.g. Montezuma's Revenge from the Arcade Learning Environment). In this paper we present the Escape Room Domain (ERD), a new flexible, scalable, and fully implemented testing domain for HRL that bridges the "moderate complexity" gap left behind by existing alternatives. ERD is open-source and freely available through GitHub, and conforms to widely-used public testing interfaces for simple integration and testing with a variety of public RL agent implementations. We show that the ERD presents a suite of challenges with scalable difficulty to provide a smooth learning gradient from Taxi to the Arcade Learning Environment.
Estimating Rationally Inattentive Utility Functions with Deep Clustering for Framing - Applications in YouTube Engagement Dynamics
Hoiles, William, Krishnamurthy, Vikram
We consider a framework involving behavioral economics and machine learning. Rationally inattentive Bayesian agents make decisions based on their posterior distribution, utility function and information acquisition cost Renyi divergence which generalizes Shannon mutual information). By observing these decisions, how can an observer estimate the utility function and information acquisition cost? Using deep learning, we estimate framing information (essential extrinsic features) that determines the agent's attention strategy. Then we present a preference based inverse reinforcement learning algorithm to test for rational inattention: is the agent an utility maximizer, attention maximizer, and does an information cost function exist that rationalizes the data? The test imposes a Renyi mutual information constraint which impacts how the agent can select attention strategies to maximize their expected utility. The test provides constructive estimates of the utility function and information acquisition cost of the agent. We illustrate these methods on a massive YouTube dataset for characterizing the commenting behavior of users.
Reinforcement Learning for Adaptive Caching with Dynamic Storage Pricing
Sadeghi, Alireza, Sheikholeslami, Fatemeh, Marques, Antonio G., Giannakis, Georgios B.
Small base stations (SBs) of fifth-generation (5G) cellular networks are envisioned to have storage devices to locally serve requests for reusable and popular contents by \emph{caching} them at the edge of the network, close to the end users. The ultimate goal is to shift part of the predictable load on the back-haul links, from on-peak to off-peak periods, contributing to a better overall network performance and service experience. To enable the SBs with efficient \textit{fetch-cache} decision-making schemes operating in dynamic settings, this paper introduces simple but flexible generic time-varying fetching and caching costs, which are then used to formulate a constrained minimization of the aggregate cost across files and time. Since caching decisions per time slot influence the content availability in future slots, the novel formulation for optimal fetch-cache decisions falls into the class of dynamic programming. Under this generic formulation, first by considering stationary distributions for the costs and file popularities, an efficient reinforcement learning-based solver known as value iteration algorithm can be used to solve the emerging optimization problem. Later, it is shown that practical limitations on cache capacity can be handled using a particular instance of the generic dynamic pricing formulation. Under this setting, to provide a light-weight online solver for the corresponding optimization, the well-known reinforcement learning algorithm, $Q$-learning, is employed to find optimal fetch-cache decisions. Numerical tests corroborating the merits of the proposed approach wrap up the paper.
Learning to Navigate the Web
Gur, Izzeddin, Rueckert, Ulrich, Faust, Aleksandra, Hakkani-Tur, Dilek
Learning in environments with large state and action spaces, and sparse rewards, can hinder a Reinforcement Learning (RL) agent's learning through trial-and-error. For instance, following natural language instructions on the Web (such as booking a flight ticket) leads to RL settings where input vocabulary and number of actionable elements on a page can grow very large. Even though recent approaches improve the success rate on relatively simple environments with the help of human demonstrations to guide the exploration, they still fail in environments where the set of possible instructions can reach millions. We approach the aforementioned problems from a different perspective and propose guided RL approaches that can generate unbounded amount of experience for an agent to learn from. Instead of learning from a complicated instruction with a large vocabulary, we decompose it into multiple sub-instructions and schedule a curriculum in which an agent is tasked with a gradually increasing subset of these relatively easier sub-instructions. In addition, when the expert demonstrations are not available, we propose a novel meta-learning framework that generates new instruction following tasks and trains the agent more effectively. We train DQN, deep reinforcement learning agent, with Q-value function approximated with a novel QWeb neural network architecture on these smaller, synthetic instructions. We evaluate the ability of our agent to generalize to new instructions on World of Bits benchmark, on forms with up to 100 elements, supporting 14 million possible instructions. The QWeb agent outperforms the baseline without using any human demonstration achieving 100% success rate on several difficult environments.
Optimizing Quantum Error Correction Codes with Reinforcement Learning
Nautrup, Hendrik Poulsen, Delfosse, Nicolas, Dunjko, Vedran, Briegel, Hans J., Friis, Nicolai
Quantum error correction is widely thought to be the key to fault-tolerant quantum computation. However, determining the most suited encoding for unknown error channels or specific laboratory setups is highly challenging. Here, we present a reinforcement learning framework for optimizing and fault-tolerantly adapting quantum error correction codes. We consider a reinforcement learning agent tasked with modifying a quantum memory until a desired logical error rate is reached. Using efficient simulations of a surface code quantum memory with about 70 physical qubits, we demonstrate that such a reinforcement learning agent can determine near-optimal solutions, in terms of the number of physical qubits, for various error models of interest. Moreover, we show that agents trained on one task are able to transfer their experience to similar tasks. This ability for transfer learning showcases the inherent strengths of reinforcement learning and the applicability of our approach for optimization both in off-line simulations and on-line under laboratory conditions.
Pre-training with Non-expert Human Demonstration for Deep Reinforcement Learning
de la Cruz, Gabriel V., Du, Yunshu, Taylor, Matthew E.
Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using deep neural networks as function approximators to learn directly from raw input images. However, learning directly from raw images is data inefficient. The agent must learn feature representation of complex states in addition to learning a policy. As a result, deep RL typically suffers from slow learning speeds and often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable to real-world settings where data is expensive. In this work, we improve data efficiency in deep RL by addressing one of the two learning goals, feature learning. We leverage supervised learning to pre-train on a small set of non-expert human demonstrations and empirically evaluate our approach using the asynchronous advantage actor-critic algorithms (A3C) in the Atari domain. Our results show significant improvements in learning speed, even when the provided demonstration is noisy and of low quality.
RL -- Deep Reinforcement Learning (Learn effectively like a human)
With the brute force of GPUs and the better understanding of AI, we beat the GO champions and Face ID comes with every new iPhone. But in the robotic world, training a robot to peel lettuce makes the news. Even with an unfair advantage over computation speed, a computer still cannot manage tasks that we take it for granted. The dilemma is AI does not learn as effectively as the human. We may be just a couple of papers away from another breakthrough or we need to learn more effectively.
Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems
Malik, Dhruv, Pananjady, Ashwin, Bhatia, Kush, Khamaru, Koulik, Bartlett, Peter L., Wainwright, Martin J.
We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of these methods when applied to linear-quadratic systems, and study various settings of driving noise and reward feedback. We show that these methods provably converge to within any pre-specified tolerance of the optimal policy with a number of zero-order evaluations that is an explicit polynomial of the error tolerance, dimension, and curvature properties of the problem. Our analysis reveals some interesting differences between the settings of additive driving noise and random initialization, as well as the settings of one-point and two-point reward feedback. Our theory is corroborated by extensive simulations of derivative-free methods on these systems. Along the way, we derive convergence rates for stochastic zero-order optimization algorithms when applied to a certain class of non-convex problems.