AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

(More) Efficient Reinforcement Learning via Posterior Sampling

Osband, Ian, Russo, Daniel, Roy, Benjamin Van

Neural Information Processing SystemsFeb-14-2020, 19:13:16 GMT

Most provably efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode.

algorithm, efficient reinforcement learning, posterior sampling, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Add feedback

Bayesian Adversarial Learning

Ye, Nanyang, Zhu, Zhanxing

Neural Information Processing SystemsFeb-14-2020, 19:10:52 GMT

Deep neural networks have been known to be vulnerable to adversarial attacks, raising lots of security concerns in the practical deployment. Popular defensive approaches can be formulated as a (distributionally) robust optimization problem, which minimizes a point estimate'' of worst-case loss derived from either per-datum perturbation or adversary data-generating distribution within certain pre-defined constraints. This point estimate ignores potential test adversaries that are beyond the pre-defined constraints. The model robustness might deteriorate sharply in the scenario of stronger test adversarial data. In this work, a novel robust training framework is proposed to alleviate this issue, Bayesian Robust Learning, in which a distribution is put on the adversarial data-generating distribution to account for the uncertainty of the adversarial data-generating process.

bayesian adversarial learning, data-generating distribution, pre-defined constraint, (3 more...)

Neural Information Processing Systems

Industry: Information Technology > Security & Privacy (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search

Shen, Yelong, Chen, Jianshu, Huang, Po-Sen, Guo, Yuqing, Gao, Jianfeng

Neural Information Processing SystemsFeb-14-2020, 18:57:02 GMT

Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model. To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (i.e., history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards.

learning, m-walk, monte carlo tree search, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.64)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.62)

Add feedback

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Lowe, Ryan, WU, YI, Tamar, Aviv, Harb, Jean, Abbeel, OpenAI Pieter, Mordatch, Igor

Neural Information Processing SystemsFeb-14-2020, 18:56:13 GMT

We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

cooperative-competitive environment, multi-agent actor-critic

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Boltzmann Exploration Done Right

Cesa-Bianchi, Nicolò, Gentile, Claudio, Lugosi, Gabor, Neu, Gergely

Neural Information Processing SystemsFeb-14-2020, 18:43:39 GMT

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate?

boltzmann exploration done right, delta, variant

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Griffith, Shane, Subramanian, Kaushik, Scholz, Jonathan, Isbell, Charles L., Thomaz, Andrea L.

Neural Information Processing SystemsFeb-14-2020, 18:42:52 GMT

A long term goal of Interactive Reinforcement Learning is to incorporate non-expert human feedback to solve complex tasks. State-of-the-art methods have approached this problem by mapping human information to reward and value signals to indicate preferences and then iterating over them to compute the necessary control policy. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct labels on the policy. We compare Advise to state-of-the-art approaches and highlight scenarios where it outperforms them and importantly is robust to infrequent and inconsistent human feedback.

integrating human feedback, policy shaping, reinforcement learning, (1 more...)

Neural Information Processing Systems

Genre: Research Report > Promising Solution (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Add feedback

Value Prediction Network

Oh, Junhyuk, Singh, Satinder, Lee, Honglak

Neural Information Processing SystemsFeb-14-2020, 18:28:34 GMT

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation. Papers published at the Neural Information Processing Systems Conference.

model-based rl method, value prediction network

Neural Information Processing Systems

Genre: Research Report (0.67)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Add feedback

Exponentially Weighted Imitation Learning for Batched Historical Data

Wang, Qing, Xiong, Jiechao, Han, Lei, sun, peng, Liu, Han, Zhang, Tong

Neural Information Processing SystemsFeb-14-2020, 18:12:57 GMT

We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or environment oracle'' as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically.

batched historical data, exponentially weighted imitation learning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Projected Natural Actor-Critic

Thomas, Philip S., Dabney, William C., Giguere, Stephen, Mahadevan, Sridhar

Neural Information Processing SystemsFeb-14-2020, 18:11:58 GMT

Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

actor-critic, algorithm, natural actor-critic, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.32)

Add feedback

Transfer of Value Functions via Variational Methods

Tirinzoni, Andrea, Sanchez, Rafael Rodriguez, Restelli, Marcello

Neural Information Processing SystemsFeb-14-2020, 18:10:35 GMT

We consider the problem of transferring value functions in reinforcement learning. We propose an approach that uses the given source tasks to learn a prior distribution over optimal value functions and provide an efficient variational approximation of the corresponding posterior in a new target task. We show our approach to be general, in the sense that it can be combined with complex parametric function approximators and distribution models, while providing two practical algorithms based on Gaussians and Gaussian mixtures. We theoretically analyze them by deriving a finite-sample analysis and provide a comprehensive empirical evaluation in four different domains. Papers published at the Neural Information Processing Systems Conference.

value function, variational method

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.32)

Add feedback