Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing Systems 

Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to 1, which is equivalent to making the expected horizon very large.