n-Step Temporal Difference Learning with Optimal n

Mandal, Lakshmi, Bhatnagar, Shalabh

arXiv.org Artificial Intelligence 

We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure that we adopt to the discrete optimization setting by using a random projection approach. We prove the convergence of our proposed algorithm, SDPSA, using a differential inclusions approach and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for arbitrary initial values. I. INTRODUCTION Reinforcement learning (RL) algorithms are widely used for solving problems of sequential decisionmaking under uncertainty. An RL agent typically makes decisions based on data that it collects through interactions with the environment in order to maximize a certain long-term reward [1], [2].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found