Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy
Guo, Yijie, Choi, Jongwook, Moczulski, Marcin, Bengio, Samy, Norouzi, Mohammad, Lee, Honglak
–arXiv.org Artificial Intelligence
This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent's own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal solution in reinforcement learning problems, especially when the reward is sparse and deceptive. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various sparse-reward reinforcement learning tasks with local optima. In particular, we report a state-of-the-art score of more than 25,000 points on Montezuma's Revenge without using expert demonstrations or resetting to arbitrary states.
arXiv.org Artificial Intelligence
Jul-24-2019
- Country:
- North America > United States > Michigan (0.04)
- Genre:
- Research Report (0.50)
- Industry:
- Education > Focused Education (0.34)
- Technology: