Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy

Guo, Yijie, Choi, Jongwook, Moczulski, Marcin, Bengio, Samy, Norouzi, Mohammad, Lee, Honglak

Jul-24-2019–arXiv.org Artificial Intelligence

This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent's own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal solution in reinforcement learning problems, especially when the reward is sparse and deceptive. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various sparse-reward reinforcement learning tasks with local optima. In particular, we report a state-of-the-art score of more than 25,000 points on Montezuma's Revenge without using expert demonstrations or resetting to arbitrary states.

artificial intelligence, optimization problem, trajectory, (16 more...)

arXiv.org Artificial Intelligence

Jul-24-2019

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Education > Focused Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.88)
  - Representation & Reasoning > Optimization (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found