Reinforcement Learning: Overviews

Deep Reinforcement Learning with TensorFlow 2.0


In this tutorial, I will showcase the upcoming TensorFlow 2.0 features through the lens of deep reinforcement learning (DRL) by implementing an advantage actor-critic (A2C) agent to solve the classic CartPole-v0 environment. While the goal is to showcase TensorFlow 2.0, I will do my best to make the DRL aspect approachable as well, including a brief overview of the field. In fact, since the main focus of the 2.0 release is making developers' lives easier, it's a great time to get into DRL with TensorFlow -- our full agent source is under 150 lines! The code is available as a notebook here and online on Google Colab here. As TensorFlow 2.0 is still in an experimental stage, I recommend installing it in a separate (virtual) environment.

A framework for deep energy-based reinforcement learning with quantum speed-up Artificial Intelligence

In the past decade, deep learning methods have seen tremendous success in various supervised and unsupervised learning tasks such as classification and generative modeling. More recently, deep neural networks have emerged in the domain of reinforcement learning as a tool to solve decision-making problems of unprecedented complexity, e.g., navigation problems or game-playing AI. Despite the successful combinations of ideas from quantum computing with machine learning methods, there have been relatively few attempts to design quantum algorithms that would enhance deep reinforcement learning. This is partly due to the fact that quantum enhancements of deep neural networks, in general, have not been as extensively investigated as other quantum machine learning methods. In contrast, projective simulation is a reinforcement learning model inspired by the stochastic evolution of physical systems that enables a quantum speed-up in decision making. In this paper, we develop a unifying framework that connects deep learning and projective simulation, opening the route to quantum improvements in deep reinforcement learning. Our approach is based on so-called generative energy-based models to design reinforcement learning methods with a computational advantage in solving complex and large-scale decision-making problems.

Reinforcement Learning Explained: Overview, Comparisons and Applications in Business


RL algorithm learns how to act best through many attempts and failures. Trial-and-error learning is connected with the so-called long-term reward. This reward is the ultimate goal the agent learns while interacting with an environment through numerous trials and errors. The algorithm gets short-term rewards that together lead to the cumulative, long-term one. So, the key goal of reinforcement learning used today is to define the best sequence of decisions that allow the agent to solve a problem while maximizing a long-term reward. And that set of coherent actions is learned through the interaction with environment and observation of rewards in every state. Reinforcement learning is distinguished from other training styles, including supervised and unsupervised learning, by its goal and, consequently, the learning approach. Three ML training styles compared.

IPO: Interior-point Policy Optimization under Constraints Machine Learning

In this paper, we study reinforcement learning (RL) algorithms to solve real-world decision problems with the objective of maximizing the long-term reward as well as satisfying cumulative constraints. We propose a novel first-order policy optimization method, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method. Our proposed method is easy to implement with performance guarantees and can handle general types of cumulative multiconstraint settings. We conduct extensive evaluations to compare our approach with state-of-the-art baselines. Our algorithm outperforms the baseline algorithms, in terms of reward maximization and constraint satisfaction.

Explainable AI: Deep Reinforcement Learning Agents for Residential Demand Side Cost Savings in Smart Grids Artificial Intelligence

Motivated by the recent advancements in deep Reinforcement Learning (RL), we develop an RL agent to manage the operation of storage devices in a household designed to maximize demand-side cost savings. The proposed technique is data-driven, and the RL agent learns from scratch on how to efficiently use the energy storage device under variable tariff-structures Contracting the concept of the "black box" where the techniques learned by the agent are ignored. We explain the learning progression of the RL agent, and the strategies it follows based on the capacity of the storage device.

Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games Machine Learning

We study discrete-time mean-field Markov games with infinite numbers of agents where each agent aims to minimize its ergodic cost. We consider the setting where the agents have identical linear state transitions and quadratic cost functions, while the aggregated effect of the agents is captured by the population mean of their states, namely, the mean-field state. For such a game, based on the Nash certainty equivalence principle, we provide sufficient conditions for the existence and uniqueness of its Nash equilibrium. Moreover, to find the Nash equilibrium, we propose a mean-field actor-critic algorithm with linear function approximation, which does not require knowing the model of dynamics. Specifically, at each iteration of our algorithm, we use the single-agent actor-critic algorithm to approximately obtain the optimal policy of the each agent given the current mean-field state, and then update the mean-field state. In particular, we prove that our algorithm converges to the Nash equilibrium at a linear rate. To the best of our knowledge, this is the first success of applying model-free reinforcement learning with function approximation to discrete-time mean-field Markov games with provable non-asymptotic global convergence guarantees.

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes Artificial Intelligence

Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $\mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $\mathcal{O}(\sqrt{T})$, albeit with a stronger ergodic assumption. To the best of our knowledge, these are the first model-free algorithms with sub-linear regret (that is polynomial in all parameters) in the infinite-horizon average-reward setting.

Extracting Incentives from Black-Box Decisions Artificial Intelligence

An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects' behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clear, we prove that for many non-linear functions (e.g. neural networks, random forests), classical methods for interpreting the behavior of models (e.g. input gradients) provide poor advice to individuals on which actions they should take. In this work, we propose a mathematical framework for understanding algorithmic incentives as the challenge of solving a Markov Decision Process, where the state includes the set of input features, and the reward is a function of the model's output. We can then leverage the many toolkits for solving MDPs (e.g. tree-based planning, reinforcement learning) to identify the optimal actions each individual is incentivized to take to improve their decision under a given model. We demonstrate the utility of our method by estimating the maximally-incentivized actions in two real-world settings: a recidivism risk predictor we train using ProPublica's COMPAS dataset, and an online credit scoring tool published by the Fair Isaac Corporation (FICO).

Autonomous Driving using Safe Reinforcement Learning by Incorporating a Regret-based Human Lane-Changing Decision Model Artificial Intelligence

-- It is expected that many human drivers will still prefer to drive themselves even if the self-driving technologies are ready. T o enable A Vs to safely and efficiently maneuver in this mixed traffic, it is critical that the A Vs can understand how humans cope with risks and make driving-related decisions. On the other hand, the driving environment is highly dynamic and ever-changing, and it is thus difficult to enumerate all the scenarios and hard-code the controllers. T o face up these challenges, in this work, we incorporate a human decision-making model in reinforcement learning to control A Vs for safe and efficient operations. Specifically, we adapt regret theory to describe a human driver's lane-changing behavior, and fit the personalized models to individual drivers for predicting their lane-changing decisions. The predicted decisions are incorporated in the safety constraints for reinforcement learning in training and in implementation. We then use an extended version of double deep Q-network (DDQN) to train our A V controller within the safety set. By doing so, the amount of collisions in training is reduced to zero, while the training accuracy is not impinged. I. INTRODUCTION Autonomous driving has attracted significant research interest in the past two decades as it offers the potential to release drivers from exhausting driving. While great progresses have been made in the field of perception, path planning, and controls, high-level decision-making remains a big challenge due to the involvement of complex, cluttered environment and the dynamic, uncertain behaviors of other traffic users. Some recent works have been applying reinforcement learning (RL) methods to autonomous driving and promising performance [1] has been reported. RL-based methods can learn the decision-making and driving behaviors which are hard, if not infeasible, for traditional rule-based designs, and often with much less human effort. However, it is reported in [2] that when using RL-based methods lots of collisions happen before the agent starts to behave properly.

RLCard: A Toolkit for Reinforcement Learning in Card Games Artificial Intelligence

RLCard is an open-source toolkit for reinforcement learning research in card games. It supports various card environments with easy-to-use interfaces, including Blackjack, Leduc Hold'em, Texas Hold'em, UNO, Dou Dizhu and Mahjong. The goal of RLCard is to bridge reinforcement learning and imperfect information games, and push forward the research of reinforcement learning in domains with multiple agents, large state and action space, and sparse reward. In this paper, we provide an overview of the key components in RLCard, a discussion of the design principles, a brief introduction of the interfaces, and comprehensive evaluations of the environments.