Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
Donâncio, Henrique, Barrier, Antoine, South, Leah F., Forbes, Florence
–arXiv.org Artificial Intelligence
Reinforcement Learning (RL), when combined with function approximators such as Artificial Neural Networks (ANNs), has shown success in learning policies that outperform humans in complex games by leveraging extensive datasets (see, e.g., 33, 19, 39, 40). While ANNs were previously used as value function approximators [29], the introduction of Deep Q-Networks (DQN) by [24, 25] marked a significant breakthrough by improving learning stability through two mechanisms: the target network and experience replay. The experience replay (see 22) stores the agent's interactions within the environment, allowing sampling of past interactions in a random way that disrupts their correlation. The target network further stabilizes the learning process by periodically copying the parameters of the learning network. This strategy is crucial because the Bellman update --using estimations to update other estimations-- would otherwise occur using the same network, potentially causing divergence. By leveraging the target network, gradient steps are directed towards a periodically fixed target, ensuring more stability in the learning process. Additionally, the learning rate hyperparameter controls the magnitude of these gradient steps in optimizers such as the stochastic gradient descent algorithm, affecting the training convergence. The learning rate is one of the most important hyperparameters, with previous work demonstrating that decreasing its value during policy finetuning can enhance performance by up to 25% in vanilla DQN [3].
arXiv.org Artificial Intelligence
Oct-16-2024
- Country:
- Europe (1.00)
- North America > United States (0.93)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Leisure & Entertainment > Games (0.47)
- Technology: