An Adiabatic Theorem for Policy Tracking with TD-learning
–arXiv.org Artificial Intelligence
Policy evaluation and, in particular, temporal difference (TD) learning is a key ingredient in reinforcement learning. Here the expected value of future rewards is estimated from simulations of a given policy. When a stationary policy is fixed, the simulated process is a time-homogeneous Markov chain. The convergence of the policy evaluation algorithm is analyzed using stochastic approximation techniques under asynchronous Markovian updates. There is a well-developed theory of stochastic approximation that establishes the convergence of a variety of policy evaluation schemes.
arXiv.org Artificial Intelligence
Oct-30-2020
- Country:
- Europe
- Montenegro (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Europe
- Genre:
- Research Report > New Finding (0.46)