One-Shot Averaging for Distributed TD($\lambda$) Under Markov Sampling
Tian, Haoxing, Paschalidis, Ioannis Ch., Olshevsky, Alex
–arXiv.org Artificial Intelligence
Actor-critic method achieves state-of-the-art performance in many domains including robotics, game playing, and control systems (LeCun et al. (2015); Mnih et al. (2016); Silver et al. (2017)). Temporal Difference (TD) Learning may be thought of as a component of actor critic, and better bounds for TD Learning are usually ingredients of actor-critic analyses. We consider the problem of policy evaluation in reinforcement learning: given a Markov Decision Process (MDP) and a policy, we need to estimate the value of each state (expected discounted sum of all future rewards) under this policy. Policy evaluation is important because it is effectively a subroutine of many other algorithms such as policy iteration and actor-critic. The main challenges for policy evaluation are that we usually do not know the underlying MDP directly and can only interact with it, and that the number of states is typically too large forcing us to maintain a low-dimensional approximation of the true vector of state values.
arXiv.org Artificial Intelligence
May-31-2024
- Country:
- North America > United States
- New York (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- North America > United States
- Genre:
- Research Report (0.40)