An Adiabatic Theorem for Policy Tracking with TD-learning

Oct-30-2020–arXiv.org Artificial Intelligence

Policy evaluation and, in particular, temporal difference (TD) learning is a key ingredient in reinforcement learning. Here the expected value of future rewards is estimated from simulations of a given policy. When a stationary policy is fixed, the simulated process is a time-homogeneous Markov chain. The convergence of the policy evaluation algorithm is analyzed using stochastic approximation techniques under asynchronous Markovian updates. There is a well-developed theory of stochastic approximation that establishes the convergence of a variety of policy evaluation schemes.

convergence, markov chain, transition matrix, (13 more...)

arXiv.org Artificial Intelligence

Oct-30-2020

arXiv.org PDF

Add feedback

Country:
- Europe
  - Montenegro (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.52)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found