Supervised Policy Update

Vuong, Quan Ho, Zhang, Yiming, Ross, Keith W.

May-29-2018–arXiv.org Artificial Intelligence

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

May-29-2018

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)
- Asia (0.28)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.91)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found