adiabatic theorem
An Adiabatic Theorem for Policy Tracking with TD-learning
Policy evaluation and, in particular, temporal difference (TD) learning is a key ingredient in reinforcement learning. Here the expected value of future rewards is estimated from simulations of a given policy. When a stationary policy is fixed, the simulated process is a time-homogeneous Markov chain. The convergence of the policy evaluation algorithm is analyzed using stochastic approximation techniques under asynchronous Markovian updates. There is a well-developed theory of stochastic approximation that establishes the convergence of a variety of policy evaluation schemes.
An Adiabatic Theorem for Policy Tracking with TD-learning
We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and $Q$-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.