Zeroth-Order Supervised Policy Improvement

Sun, Hao, Xu, Ziping, Song, Yuhang, Fang, Meng, Xiong, Jiechao, Dai, Bo, Zhang, Zhengyou, Zhou, Bolei

Jun-11-2020–arXiv.org Artificial Intelligence

Despite the remarkable progress made by the policy gradient algorithms in reinforcement learning (RL), sub-optimal policies usually result from the local exploration property of the policy gradient update. In this work, we propose a method referred to as Zeroth-Order Supervised Policy Improvement (ZOSPI) that exploits the estimated value function Q globally while preserves the local exploitation of the policy gradient methods. We prove that with a good function structure, the zeroth-order optimization strategy combining both local and global samplings can find the global minima within a polynomial number of samples. To improve the exploration efficiency in unknown environments, ZOSPI is further combined with bootstrapped Q networks. Different from the standard policy gradient methods, the policy learning of ZOSPI is conducted in a self-supervision manner so that the policy can be implemented with gradient-free non-parametric models besides the neural network approximator. Experiments show that ZOSPI achieves competitive results on MuJoCo locomotion tasks with a remarkable sample efficiency.

artificial intelligence, optimization problem, zospi, (12 more...)

arXiv.org Artificial Intelligence

Jun-11-2020

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Reinforcement Learning (0.70)
    - Statistical Learning > Gradient Descent (0.34)
  - Representation & Reasoning > Optimization (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found