AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Lee, Su Young, Sungik, Choi, Chung, Sae-Young

Neural Information Processing SystemsMar-18-2020, 21:16:16 GMT

We propose Episodic Backward Update (EBU) – a novel deep reinforcement learning algorithm with a direct value propagation. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively. Papers published at the Neural Information Processing Systems Conference.

episodic backward update, sample-efficient deep reinforcement learning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Generalized Off-Policy Actor-Critic

Zhang, Shangtong, Boehmer, Wendelin, Whiteson, Shimon

Neural Information Processing SystemsMar-18-2020, 21:15:42 GMT

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. Papers published at the Neural Information Processing Systems Conference.

algorithm, generalized off-policy actor-critic, objective, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

Li, Siyuan, Wang, Rui, Tang, Minxue, Zhang, Chongjie

Neural Information Processing SystemsMar-18-2020, 21:01:22 GMT

Hierarchical Reinforcement Learning (HRL) is a promising approach to solving long-horizon problems with sparse and delayed rewards. Many existing HRL algorithms either use pre-trained low-level skills that are unadaptable, or require domain-specific information to define low-level rewards. In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy. This auxiliary reward enables efficient, simultaneous learning of the high-level policy and low-level skills without using task-specific knowledge.

advantage-based auxiliary reward, hierarchical reinforcement learning, low-level skill, (1 more...)

Neural Information Processing Systems

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback

Finding Friend and Foe in Multi-Agent Games

Serrino, Jack, Kleiman-Weiner, Max, Parkes, David C., Tenenbaum, Josh

Neural Information Processing SystemsMar-18-2020, 20:47:42 GMT

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on "The Resistance: Avalon", the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play.

deeprole outperform, friend and foe, multi-agent game, (4 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Qu, Chao, Mannor, Shie, Xu, Huan, Qi, Yuan, Song, Le, Xiong, Junwu

Neural Information Processing SystemsMar-18-2020, 20:47:21 GMT

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume each agent is located at a node of a communication network and can exchange information only with its neighbors. Using softmax temporal consistency, we derive a primal-dual decentralized optimization method and obtain a principled and data-efficient iterative algorithm named {\em value propagation}. We prove a non-asymptotic convergence rate of $\mathcal{O}(1/T)$ with nonlinear function approximation.

function approximation, networked deep multi-agent reinforcement learning, value propagation, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reconciling λ-Returns with Experience Replay

Daley, Brett, Amato, Christopher

Neural Information Processing SystemsMar-18-2020, 20:47:04 GMT

Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the λ-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of λ-returns. Yet replay-based methods are often the most sample efficient, and incorporating λ-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of λ-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent λ-returns can be efficiently precomputed by sharing Q-values.

experience replay, replay-based method, transition

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Cheung, Wang Chi

Neural Information Processing SystemsMar-18-2020, 20:32:55 GMT

We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure.

regret minimization, reinforcement learning, vectorial feedback and complex objective, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Trust Region-Guided Proximal Policy Optimization

Wang, Yuhui, He, Hao, Tan, Xiaoyang, Gan, Yaozhong

Neural Information Processing SystemsMar-18-2020, 20:32:24 GMT

Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well.

ppo, trust region-guided proximal policy optimization

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback

Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling

Kolobov, Andrey, Peres, Yuval, Lu, Cheng, Horvitz, Eric J.

Neural Information Processing SystemsMar-18-2020, 20:32:10 GMT

From traditional Web search engines to virtual assistants and Web accelerators, services that rely on online information need to continually keep track of remote content changes by explicitly requesting content updates from remote sources (e.g., web pages). We propose a novel optimization objective for this setting that has several practically desirable properties, and efficient algorithms for it with optimality guarantees even in the face of mixed content change observability and initially unknown change model parameters. Experiments on 18.5M URLs crawled daily for 14 weeks show significant advantages of this approach over prior art. Papers published at the Neural Information Processing Systems Conference.

online content change, reinforcement learning, scheduling

Neural Information Processing Systems

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Experience Replay for Continual Learning

Rolnick, David, Ahuja, Arun, Schwarz, Jonathan, Lillicrap, Timothy, Wayne, Gregory

Neural Information Processing SystemsMar-18-2020, 20:30:57 GMT

Interacting with a complex world involves continual learning, in which tasks and data distributions change over time. A continual learning system should demonstrate both plasticity (acquisition of new knowledge) and stability (preservation of old knowledge). Catastrophic forgetting is the failure of stability, in which new experience overwrites previous experience. In the brain, replay of past experience is widely believed to reduce forgetting, yet it has been largely overlooked as a solution to forgetting in deep reinforcement learning. Here, we introduce CLEAR, a replay-based method that greatly reduces catastrophic forgetting in multi-task reinforcement learning.

continual learning, experience replay, stability, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)

Add feedback