n-Step Temporal Difference Learning with Optimal n
Mandal, Lakshmi, Bhatnagar, Shalabh
–arXiv.org Artificial Intelligence
We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure that we adopt to the discrete optimization setting by using a random projection approach. We prove the convergence of our proposed algorithm, SDPSA, using a differential inclusions approach and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for arbitrary initial values. I. INTRODUCTION Reinforcement learning (RL) algorithms are widely used for solving problems of sequential decisionmaking under uncertainty. An RL agent typically makes decisions based on data that it collects through interactions with the environment in order to maximize a certain long-term reward [1], [2].
arXiv.org Artificial Intelligence
Apr-14-2023
- Country:
- Asia > India
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: