Meta-Gradient Reinforcement Learning
Zhongwen Xu, Hado P. van Hasselt, David Silver
–Neural Information Processing Systems
The goal of reinforcement learning algorithms is to estimate and/or optimise the value function. However, unlike supervised learning, no teacher or oracle is available to provide the true value function. Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function. This proxy is typically based on a sampled and bootstrapped approximation to the true value function, known as a return. The particular choice of return is one of the chief components determining the nature of the algorithm: the rate at which future rewards are discounted; when and how values should be bootstrapped; or even the nature of the rewards themselves.
Neural Information Processing Systems
Oct-7-2024, 07:01:06 GMT