Sample-Efficient Automated Deep Reinforcement Learning
Franke, Jörg K. H., Köhler, Gregor, Biedenkapp, André, Hutter, Frank
Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training. Deep reinforcement learning (RL) algorithms are often sensitive to the choice of internal hyperparameters (Jaderberg et al., 2017; Mahmood et al., 2018), and the hyperparameters of the neural network architecture (Islam et al., 2017; Henderson et al., 2018), hindering them from being applied out-of-the-box to new environments. Tuning hyperparameters of RL algorithms can quickly become very expensive, both in terms of high computational costs and a large number of required environment interactions. Especially in real-world applications, sample efficiency is crucial (Lee et al., 2019). Hyperparameter optimization (HPO; Snoek et al., 2012; Feurer & Hutter, 2019) approaches often treat the algorithm under optimization as a black-box, which in the setting of RL requires a full training run every time a configuration is evaluated. This leads to a suboptimal sample efficiency in terms of environment interactions.
Oct-5-2020
- Country:
- Europe > Germany > Baden-Württemberg > Karlsruhe Region (0.14)
- Genre:
- Research Report (0.82)
- Technology: