Compatible Natural Gradient Policy Search
Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, Neumann, Gerhard
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Feb-7-2019
- Country:
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- Europe
- United Kingdom > England
- Lincolnshire > Lincoln (0.04)
- Cambridgeshire > Cambridge (0.04)
- Germany
- Hesse > Darmstadt Region
- Darmstadt (0.04)
- Baden-Württemberg > Tübingen Region
- Tübingen (0.04)
- Hesse > Darmstadt Region
- United Kingdom > England
- North America > United States
- Genre:
- Research Report > New Finding (0.88)