AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Deep Reinforcement Learning in Ice Hockey for Context-Aware Player Evaluation

Liu, Guiliang, Schulte, Oliver

arXiv.org Artificial IntelligenceMay-26-2018

A variety of machine learning models have been proposed to assess the performance of players in professional sports. However, they have only a limited ability to model how player performance depends on the game context. This paper proposes a new approach to capturing game context: we apply Deep Reinforcement Learning (DRL) to learn an action-value Q function from 3M play-by-play events in the National Hockey League (NHL). The neural network representation integrates both continuous context signals and game history, using a possession-based LSTM. The learned Q-function is used to value players' actions under different game contexts. To assess a player's overall performance, we introduce a novel Game Impact Metric (GIM) that aggregates the values of the player's actions. Empirical Evaluation shows GIM is consistent throughout a play season, and correlates highly with standard success measures and future salary.

machine learning, player evaluation, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

1805.11088

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Hockey (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.57)

Add feedback

GAN Q-learning – Arxiv Vanity

#artificialintelligenceMay-25-2018, 15:21:42 GMT

Up to now, deep learning methods in RL used multiple function approximators (typically a network with shared hidden layers) to fit a state value or state-action value distribution. For instance, bootstrappedDQN () used k-heads on the state-action value function Q for every available action and used it to model a distribution. In bayesianpol (), a Bayesian framework was applied to the actor-critic architecture by fitting a Gaussian Process (GP) instead of the critic, hence allowing for a closed-form derivation of update rules. More recently, bellemare2017distributional () introduced a distributional algorithm C51 which aimed to solve the RL problem by learning a categorical probability vector over returns Q. Unlike GANRL () which uses a generative network to learn the underlying transition model of the environment, we utilize a generative network to model the distribution approximation of the Bellman updates.

arxiv vanity, deep learning, reinforcement learning, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

Visceral Machines: Reinforcement Learning with Intrinsic Rewards that Mimic the Human Nervous System

McDuff, Daniel, Kapoor, Ashish

arXiv.org Artificial IntelligenceMay-25-2018

The human autonomic nervous system has evolved over millions of years and is essential for survival and responding to threats. As people learn to navigate the world, "fight or flight" responses provide intrinsic feedback about the potential consequence of action choices (e.g., becoming nervous when close to a cliff edge or driving fast around a bend.) We present a novel approach to reinforcement learning that leverages a task-independent intrinsic reward function that mimics human autonomic nervous system responses based on peripheral pulse measurements. Our hypothesis is that such reward functions can circumvent the challenges associated with sparse and skewed rewards in reinforcement learning settings and can help improve sample efficiency. We test this in a simulated driving environment and show that it can increase the speed of learning and reduce the number of collisions during the learning stage.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

1805.09975

Country: North America > United States > Washington > King County > Redmond (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.92)

Add feedback

Finite Sample Analysis of LSTD with Random Projections and Eligibility Traces

Li, Haifang, Xia, Yingce, Zhang, Wensheng

arXiv.org Artificial IntelligenceMay-25-2018

Policy evaluation, commonly referred to as value function approximation, is an important and central part in many reinforcement learning (RL) algorithms [27], whose task is to estimate value functions for a fixed policy in a discounted Markov Decision Process (MDP) environment. The value function of each state specifies the accumulated reward an agent would receive in the future by following the fixed policy from that state. Value functions have been widely investigated in RL applications, and it can provide insightful and important information for the agent to obtain an optimal policy, such as important board configurations in Go [24], failure probabilities of large telecommunication networks [9], taxi-out times at large airports [2] and so on. Despite the value functions can be approximated by different ways, the simplest form, linear approximations, are still widely adopted and studied due to their good generalization abilities, relatively efficient computation and solid theoretical guarantees[27, 7, 13, 16]. Temporal Difference (TD) learning is a common approach to this policy evaluation with linear function approximation problem[27]. These typical TD algorithms can be divided into two categories: gradient based methods (e.g., GTD(λ) [28]) and least-square (LS) based methods (e.g., LSTD(λ)[4]). A good survey on these algorithms can be found in [17, 6, 12, 7, 13]. 1 As the development of information technologies, high-dimensional data is widely seen in RL applications [26, 30, 23], which brings serious challenges to design scalable and computationally efficient algorithms for the linear value function approximation problem. To address this practical issue, several approaches have been developed for efficient and effective value function approximation.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1805.10005

Country:

North America > Canada > Alberta (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Florida (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning

Shi, Jing-Cheng, Yu, Yang, Da, Qing, Chen, Shi-Yong, Zeng, An-Xiang

arXiv.org Artificial IntelligenceMay-25-2018

Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials, as required by current reinforcement learning methods, in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao, one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly, we present our approach: first we build Virtual Taobao, a simulator learned from historical customer behavior data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversarial imitation learning), and then we train policies in Virtual Taobao with no physical costs in which ANC (Action Norm Constraint) strategy is proposed to reduce over-fitting. In experiments, Virtual Taobao is trained from hundreds of millions of customers' records, and its properties are compared with the real environment. The results disclose that Virtual Taobao faithfully recovers important properties of the real environment. We also show that the policies trained in Virtual Taobao can have significantly superior online performance to the traditional supervised approaches. We hope our work could shed some light on reinforcement learning applications in complex physical environments.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

1805.1

Country:

Europe (0.68)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry: Retail > Online (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Fast Policy Learning through Imitation and Reinforcement

Cheng, Ching-An, Yan, Xinyan, Wagener, Nolan, Boots, Byron

arXiv.org Machine LearningMay-25-2018

Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1805.10413

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)

Add feedback

Learning Self-Imitating Diverse Policies

Gangwani, Tanmay, Liu, Qiang, Peng, Jian

arXiv.org Machine LearningMay-25-2018

Deep reinforcement learning algorithms, including policy gradient methods and Q-learning, have been widely applied to a variety of decision-making problems. Their success has relied heavily on having very well designed dense reward signals, and therefore, they often perform badly on the sparse or episodic reward settings. Trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not take into consideration the temporal nature of the problem and often suffer from high sample complexity. Scaling up the efficiency of RL algorithms to real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we present a new perspective of policy optimization and introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. First, we view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. Then, we show that, with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with dense reward learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in the dense reward setting, and significantly better in the sparse and episodic settings. To encourage exploration, we further apply the Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies and demonstrate its effectiveness on a number of challenging tasks.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

1805.10309

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Imitation Learning in Unity: The Workflow – Unity Blog

#artificialintelligenceMay-24-2018, 16:47:01 GMT

With the release of ML-Agents v0.3 Beta, there are lots of new ways to use Machine Learning in your projects. Whether you're working on games, simulations, academic or any other sort of projects, your work can benefit from the use of neural networks in the virtual environment. If you've been using ML-Agents before this latest release, you will already be familiar with Reinforcement Learning. If not, I wrote a beginner's guide to get you started. This blog post will help you get up to speed with one of the major features that represent an alternative to Reinforcement Learning: Imitation Learning.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Add feedback

5 Minute Guide to AI in Cyber Security

#artificialintelligenceMay-24-2018, 11:35:50 GMT

There is another way to categorize the machine learning models above, which is supervised learning (where machine learns from past data that humans have already labeled as good or bad, attack or false positive, fraud or normal data), unsupervised learning (where no past labeled data exist) or reinforcement learning (where machine learns from feedback from its longer-term results). Supervised learning will include classification, regression and deep learning. Unsupervised learning includes clustering, association rules and pattern matching. Diagram 2 will now become diagram 3 below.

cyber security, deep learning, reinforcement learning, (3 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.76)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Meta-Gradient Reinforcement Learning

Xu, Zhongwen, van Hasselt, Hado, Silver, David

arXiv.org Artificial IntelligenceMay-24-2018

The goal of reinforcement learning algorithms is to estimate and/or optimise the value function. However, unlike supervised learning, no teacher or oracle is available to provide the true value function. Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function. This proxy is typically based on a sampled and bootstrapped approximation to the true value function, known as a return. The particular choice of return is one of the chief components determining the nature of the algorithm: the rate at which future rewards are discounted; when and how values should be bootstrapped; or even the nature of the rewards themselves. It is well-known that these decisions are crucial to the overall success of RL algorithms. We discuss a gradient-based meta-learning algorithm that is able to adapt the nature of the return, online, whilst interacting and learning from the environment. When applied to 57 games on the Atari 2600 environment over 200 million frames, our algorithm achieved a new state-of-the-art performance.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

1805.09801

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment > Sports (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback