AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

The Uncertainty Bellman Equation and Exploration

O'Donoghue, Brendan, Osband, Ian, Munos, Remi, Mnih, Volodymyr

arXiv.org Machine LearningSep-15-2017

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

bellman equation, neural network, upstream oil & gas, (17 more...)

arXiv.org Machine Learning

1709.0538

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > United States (0.15)

Genre: Research Report (0.40)

Industry:

Energy > Oil & Gas > Upstream (0.54)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.82)

Add feedback

Inverse Reinforcement Learning from Incomplete Observation Data

Kangasrääsiö, Antti, Kaski, Samuel

arXiv.org Machine LearningSep-14-2017

Inverse reinforcement learning (IRL) aims to explain observed strategic behavior by fitting reinforcement learning models to behavioral data. However, traditional IRL methods are only applicable when the observations are in the form of state-action paths. This assumption may not hold in many real-world modelling settings, where only partial observations are available. In general, we may assume that there is a summarizing function $\sigma$, which acts as a filter between us and the true state-action paths that constitute the demonstration. Some initial approaches to extending IRL to such situations have been presented, but with very specific assumptions about the structure of $\sigma$, such as that only certain state observations are missing. This paper instead focuses on the most general case of the problem, where no assumptions are made about the summarizing function, except that it can be evaluated. We demonstrate that inference is still possible. The paper presents exact and approximate inference algorithms that allow full posterior inference, which is particularly important for assessing parameter uncertainty in this challenging inference situation. Empirical scalability is demonstrated to reasonably sized problems, and practical applicability is demonstrated by estimating the posterior for a cognitive science RL model based on observed user's task completion time only.

likelihood, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

1703.097

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Towards personalized human AI interaction - adapting the behavior of AI agents using neural signatures of subjective interest

Shih, Victor, Jangraw, David C, Sajda, Paul, Saproo, Sameer

arXiv.org Machine LearningSep-13-2017

The use of Artificial Neural Networks (ANNs) towards developing Artificial Intelligence (AI) has undergone a renaissance in the past decade. Out of the many emergent techniques for training ANNs that are collectively referred to as'Deep Learning', Deep Reinforcement Learning (DRL) is proving to be a particularly general and powerful method, with applications ranging from video games [1] to autonomous driving [2]. While most applications of reinforcement learning have traditionally used reinforcement signals derived from performance measures that are explicit to the task - e.g. the score in a game or grammatical errors in a translation, when considering AI systems that are required to have a significant interaction with humans - e.g. the autonomous vehicle - it is critical to consider how the human's preference for objects, events, or actions can be incorporated into the behavioral reinforcement for the AI, particularly in ways that are minimally obtrusive [3], [4]. Such behavioral adaptations occur naturally during social interactions and form the bedrock of social mechanisms that build trust and rapport between strangers [5], [6]. In this paper, we present a novel approach that uses decoded human neurophysiological and ocular time-series data as an implicit reinforcement signal for an AI agent that is driving a virtual automobile.

machine learning, reinforcement, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1709.04574

Country: North America > United States (0.47)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Transportation > Ground > Road (0.87)
Automobiles & Trucks (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)

Add feedback

Guiding Reinforcement Learning Exploration Using Natural Language

Harrison, Brent, Ehsan, Upol, Riedl, Mark O.

arXiv.org Machine LearningSep-13-2017

In this work we present a technique to use natural language to help reinforcement learning generalize to unseen environments. This technique uses neural machine translation, specifically the use of encoder-decoder networks, to learn associations between natural language behavior descriptions and state-action information. We then use this learned model to guide agent exploration using a modified version of policy shaping to make it more effective at learning in unseen environments. We evaluate this technique using the popular arcade game, Frogger, under ideal and non-ideal conditions. This evaluation shows that our modified policy shaping algorithm improves over a Q-learning agent as well as a baseline version of policy shaping.

machine learning, natural language, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1707.08616

Country: North America > United States > Kentucky (0.28)

Genre: Research Report (0.50)

Industry:

Education (0.48)
Leisure & Entertainment > Games (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging

Lakshminarayanan, Chandrashekar, Szepesvári, Csaba

arXiv.org Machine LearningSep-12-2017

We consider $d$-dimensional linear stochastic approximation algorithms (LSAs) with a constant step-size and the so called Polyak-Ruppert (PR) averaging of iterates. LSAs are widely applied in machine learning and reinforcement learning (RL), where the aim is to compute an appropriate $\theta_{*} \in \mathbb{R}^d$ (that is an optimum or a fixed point) using noisy data and $O(d)$ updates per iteration. In this paper, we are motivated by the problem (in RL) of policy evaluation from experience replay using the \emph{temporal difference} (TD) class of learning algorithms that are also LSAs. For LSAs with a constant step-size, and PR averaging, we provide bounds for the mean squared error (MSE) after $t$ iterations. We assume that data is \iid with finite variance (underlying distribution being $P$) and that the expected dynamics is Hurwitz. For a given LSA with PR averaging, and data distribution $P$ satisfying the said assumptions, we show that there exists a range of constant step-sizes such that its MSE decays as $O(\frac{1}{t})$. We examine the conditions under which a constant step-size can be chosen uniformly for a class of data distributions $\mathcal{P}$, and show that not all data distributions `admit' such a uniform constant step-size. We also suggest a heuristic step-size tuning algorithm to choose a constant step-size of a given LSA for a given data distribution $P$. We compare our results with related work and also discuss the implication of our results in the context of TD algorithms that are LSAs.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1709.04073

Genre: Research Report > New Finding (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Variational inference for the multi-armed contextual bandit

Urteaga, Iñigo, Wiggins, Chris H.

arXiv.org Machine LearningSep-10-2017

In many biomedical, science, and engineering problems, one must sequentially decide which action to take next so as to maximize rewards. Reinforcement learning is an area of machine learning that studies how this maximization balances exploration and exploitation, optimizing interactions with the world while simultaneously learning how the world operates. One general class of algorithms for this type of learning is the multi-armed bandit setting and, in particular, the contextual bandit case, in which observed rewards are dependent on each action as well as on given information or 'context' available at each interaction with the world. The Thompson sampling algorithm has recently been shown to perform well in real-world settings and to enjoy provable optimality properties for this set of problems. It facilitates generative and interpretable modeling of the problem at hand, though complexity of the model limits its application, since one must both sample from the distributions modeled and calculate their expected rewards. We here show how these limitations can be overcome using variational approximations, applying to the reinforcement learning case advances developed for the inference case in the machine learning community over the past two decades. We consider bandit applications where the true reward distribution is unknown and approximate it with a mixture model, whose parameters are inferred via variational inference.

big data, thompson, upstream oil & gas, (20 more...)

arXiv.org Machine Learning

1709.03163

Country:

North America > United States > New York (0.15)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Add feedback

tensorflow/agents

@machinelearnbotSep-8-2017, 23:46:27 GMT

This project provides optimized infrastructure for reinforcement learning. It extends the OpenAI gym interface to multiple parallel environments and allows agents to be implemented in TensorFlow and perform batched computation. As a starting point, we provide BatchPPO, an optimized implementation of Proximal Policy Optimization. The algorithm to use is defined in the configuration and pendulum started here uses the included PPO implementation. Check out more pre-defined configurations in agents/scripts/configs.py.

large language model, machine learning, reinforcement learning, (8 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Machine Learning for Humans, Part 5: Reinforcement Learning

#artificialintelligenceSep-7-2017, 10:15:23 GMT

In supervised learning, training data comes with an answer key from some godlike "supervisor". If only life worked that way! In reinforcement learning (RL) there's no answer key, but your reinforcement learning agent still has to decide how to act to perform its task. In the absence of existing training data, the agent learns from experience. It collects the training examples ("this action was good, that action was bad") through trial-and-error as it attempts its task, with the goal of maximizing long-term reward.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

#artificialintelligence

Industry: Leisure & Entertainment > Games (0.32)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

Two-Timescale Stochastic Approximation Convergence Rates with Applications to Reinforcement Learning

Dalal, Gal, Szorenyi, Balazs, Thoppe, Gugan, Mannor, Shie

arXiv.org Artificial IntelligenceSep-7-2017

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated with distinct stepsizes. In this work we provide a recipe for analyzing two-timescale SA. Using it, we develop the first convergence rate result for them. From this result we extract key insights on stepsize selection. As an application, we obtain convergence rates for two-timescale RL algorithms such as GTD(0), GTD2, and TDC.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

1703.05376

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Salimans, Tim, Ho, Jonathan, Chen, Xi, Sidor, Szymon, Sutskever, Ilya

arXiv.org Artificial IntelligenceSep-7-2017

We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

evolutionary algorithm, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1703.03864

Genre: Research Report (0.50)

Industry:

Transportation > Air (0.88)
Leisure & Entertainment > Games > Computer Games (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback