Adapting Behaviour for Learning Progress

Schaul, Tom, Borsa, Diana, Ding, David, Szepesvari, David, Ostrovski, Georg, Dabney, Will, Osindero, Simon

arXiv.org Artificial Intelligence 

A BSTRACT Determining what experience to generate to best facilitate learning (i.e. The advent of distributed agents that interact with parallel instances of the environment has enabled larger scales and greater flexibility, but has not removed the need to tune exploration to the task, because the ideal data for the learning algorithm necessarily depends on its process of learning. We propose to dynamically adapt the data generation by using a non-stationary multi-armed bandit to optimize a proxy of the learning progress. The data distribution is controlled by modulating multiple parameters of the policy (such as stochasticity, consistency or optimism) without significant overhead. The adaptation speed of the bandit can be increased by exploiting the factored modulation structure. We demonstrate on a suite of Atari 2600 games how this unified approach produces results comparable to per-task tuning at a fraction of the cost. 1 I NTRODUCTION Reinforcement learning (RL) is a general formalism modelling sequential decision making, which supports making minimal assumptions about the task at hand and reducing the need for prior knowledge. By learning behaviour from scratch, RL agents have the potential to surpass human expertise or tackle complex domains where human intuition is not applicable. In practice, however, generality is often traded for performance and efficiency, with RL practitioners tuning algorithms, architectures and hyper-parameters to the task at hand (Hessel et al., 2019). A side-effect is that the resulting methods can be brittle, or difficult to reliably reproduce (Nagarajan et al., 2018). Exploration is one of the main aspects commonly designed or tuned specifically for the task being solved. Previous work has shown that large sample-efficiency gains are possible, for example, when the exploratory behaviour's level of stochasticity is adjusted to the environment's hazard rate (Garc ıa & Fern andez, 2015), or when an appropriate prior is used in large action spaces (Dulac-Arnold et al., 2015; Czarnecki et al., 2018; Vinyals et al., 2019). Exploration in the presence of function approximation should ideally be agent-centred. It ought to focus more on generating data that supports the agent's learning at its current parameters θ, rather than making progress on objective measurements of information gathering.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found