Reinforcement Learning
Deep Reinforcement Learning in Ice Hockey for Context-Aware Player Evaluation
Liu, Guiliang, Schulte, Oliver
A variety of machine learning models have been proposed to assess the performance of players in professional sports. However, they have only a limited ability to model how player performance depends on the game context. This paper proposes a new approach to capturing game context: we apply Deep Reinforcement Learning (DRL) to learn an action-value Q function from 3M play-by-play events in the National Hockey League (NHL). The neural network representation integrates both continuous context signals and game history, using a possession-based LSTM. The learned Q-function is used to value players' actions under different game contexts. To assess a player's overall performance, we introduce a novel Game Impact Metric (GIM) that aggregates the values of the player's actions. Empirical Evaluation shows GIM is consistent throughout a play season, and correlates highly with standard success measures and future salary.
GAN Q-learning โ Arxiv Vanity
Up to now, deep learning methods in RL used multiple function approximators (typically a network with shared hidden layers) to fit a state value or state-action value distribution. For instance, bootstrappedDQN () used k-heads on the state-action value function Q for every available action and used it to model a distribution. In bayesianpol (), a Bayesian framework was applied to the actor-critic architecture by fitting a Gaussian Process (GP) instead of the critic, hence allowing for a closed-form derivation of update rules. More recently, bellemare2017distributional () introduced a distributional algorithm C51 which aimed to solve the RL problem by learning a categorical probability vector over returns Q. Unlike GANRL () which uses a generative network to learn the underlying transition model of the environment, we utilize a generative network to model the distribution approximation of the Bellman updates.
Visceral Machines: Reinforcement Learning with Intrinsic Rewards that Mimic the Human Nervous System
McDuff, Daniel, Kapoor, Ashish
The human autonomic nervous system has evolved over millions of years and is essential for survival and responding to threats. As people learn to navigate the world, "fight or flight" responses provide intrinsic feedback about the potential consequence of action choices (e.g., becoming nervous when close to a cliff edge or driving fast around a bend.) We present a novel approach to reinforcement learning that leverages a task-independent intrinsic reward function that mimics human autonomic nervous system responses based on peripheral pulse measurements. Our hypothesis is that such reward functions can circumvent the challenges associated with sparse and skewed rewards in reinforcement learning settings and can help improve sample efficiency. We test this in a simulated driving environment and show that it can increase the speed of learning and reduce the number of collisions during the learning stage.
Finite Sample Analysis of LSTD with Random Projections and Eligibility Traces
Li, Haifang, Xia, Yingce, Zhang, Wensheng
Policy evaluation, commonly referred to as value function approximation, is an important and central part in many reinforcement learning (RL) algorithms [27], whose task is to estimate value functions for a fixed policy in a discounted Markov Decision Process (MDP) environment. The value function of each state specifies the accumulated reward an agent would receive in the future by following the fixed policy from that state. Value functions have been widely investigated in RL applications, and it can provide insightful and important information for the agent to obtain an optimal policy, such as important board configurations in Go [24], failure probabilities of large telecommunication networks [9], taxi-out times at large airports [2] and so on. Despite the value functions can be approximated by different ways, the simplest form, linear approximations, are still widely adopted and studied due to their good generalization abilities, relatively efficient computation and solid theoretical guarantees[27, 7, 13, 16]. Temporal Difference (TD) learning is a common approach to this policy evaluation with linear function approximation problem[27]. These typical TD algorithms can be divided into two categories: gradient based methods (e.g., GTD(ฮป) [28]) and least-square (LS) based methods (e.g., LSTD(ฮป)[4]). A good survey on these algorithms can be found in [17, 6, 12, 7, 13]. 1 As the development of information technologies, high-dimensional data is widely seen in RL applications [26, 30, 23], which brings serious challenges to design scalable and computationally efficient algorithms for the linear value function approximation problem. To address this practical issue, several approaches have been developed for efficient and effective value function approximation.
Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning
Shi, Jing-Cheng, Yu, Yang, Da, Qing, Chen, Shi-Yong, Zeng, An-Xiang
Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials, as required by current reinforcement learning methods, in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao, one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly, we present our approach: first we build Virtual Taobao, a simulator learned from historical customer behavior data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversarial imitation learning), and then we train policies in Virtual Taobao with no physical costs in which ANC (Action Norm Constraint) strategy is proposed to reduce over-fitting. In experiments, Virtual Taobao is trained from hundreds of millions of customers' records, and its properties are compared with the real environment. The results disclose that Virtual Taobao faithfully recovers important properties of the real environment. We also show that the policies trained in Virtual Taobao can have significantly superior online performance to the traditional supervised approaches. We hope our work could shed some light on reinforcement learning applications in complex physical environments.
Fast Policy Learning through Imitation and Reinforcement
Cheng, Ching-An, Yan, Xinyan, Wagener, Nolan, Boots, Byron
Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.
Learning Self-Imitating Diverse Policies
Gangwani, Tanmay, Liu, Qiang, Peng, Jian
Deep reinforcement learning algorithms, including policy gradient methods and Q-learning, have been widely applied to a variety of decision-making problems. Their success has relied heavily on having very well designed dense reward signals, and therefore, they often perform badly on the sparse or episodic reward settings. Trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not take into consideration the temporal nature of the problem and often suffer from high sample complexity. Scaling up the efficiency of RL algorithms to real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we present a new perspective of policy optimization and introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. First, we view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. Then, we show that, with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with dense reward learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in the dense reward setting, and significantly better in the sparse and episodic settings. To encourage exploration, we further apply the Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies and demonstrate its effectiveness on a number of challenging tasks.
Imitation Learning in Unity: The Workflow โ Unity Blog
With the release of ML-Agents v0.3 Beta, there are lots of new ways to use Machine Learning in your projects. Whether you're working on games, simulations, academic or any other sort of projects, your work can benefit from the use of neural networks in the virtual environment. If you've been using ML-Agents before this latest release, you will already be familiar with Reinforcement Learning. If not, I wrote a beginner's guide to get you started. This blog post will help you get up to speed with one of the major features that represent an alternative to Reinforcement Learning: Imitation Learning.
5 Minute Guide to AI in Cyber Security
There is another way to categorize the machine learning models above, which is supervised learning (where machine learns from past data that humans have already labeled as good or bad, attack or false positive, fraud or normal data), unsupervised learning (where no past labeled data exist) or reinforcement learning (where machine learns from feedback from its longer-term results). Supervised learning will include classification, regression and deep learning. Unsupervised learning includes clustering, association rules and pattern matching. Diagram 2 will now become diagram 3 below.
Meta-Gradient Reinforcement Learning
Xu, Zhongwen, van Hasselt, Hado, Silver, David
The goal of reinforcement learning algorithms is to estimate and/or optimise the value function. However, unlike supervised learning, no teacher or oracle is available to provide the true value function. Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function. This proxy is typically based on a sampled and bootstrapped approximation to the true value function, known as a return. The particular choice of return is one of the chief components determining the nature of the algorithm: the rate at which future rewards are discounted; when and how values should be bootstrapped; or even the nature of the rewards themselves. It is well-known that these decisions are crucial to the overall success of RL algorithms. We discuss a gradient-based meta-learning algorithm that is able to adapt the nature of the return, online, whilst interacting and learning from the environment. When applied to 57 games on the Atari 2600 environment over 200 million frames, our algorithm achieved a new state-of-the-art performance.