A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Feb-15-2020, 00:10:55 GMT–Neural Information Processing Systems

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches.

approximate policy iteration, policy oscillation phenomenon, reinterpretation, (1 more...)

Neural Information Processing Systems

Feb-15-2020, 00:10:55 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)