An Adiabatic Theorem for Policy Tracking with TD-learning

Walton, Neil

arXiv.org Artificial Intelligence 

Policy evaluation and, in particular, temporal difference (TD) learning is a key ingredient in reinforcement learning. Here the expected value of future rewards is estimated from simulations of a given policy. When a stationary policy is fixed, the simulated process is a time-homogeneous Markov chain. The convergence of the policy evaluation algorithm is analyzed using stochastic approximation techniques under asynchronous Markovian updates. There is a well-developed theory of stochastic approximation that establishes the convergence of a variety of policy evaluation schemes.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found