Provably Efficient Maximum Entropy Exploration

Hazan, Elad, Kakade, Sham M., Singh, Karan, Van Soest, Abby

Dec-6-2018–arXiv.org Artificial Intelligence

Suppose an agent is in a (possibly unknown) Markov decision process (MDP) in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. Despite the corresponding mathematical program being non-convex, our main result provides a provably efficient method (both in terms of sample size and computational complexity) to construct such a maximum-entropy exploratory policy. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

artificial intelligence, neural network, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

Dec-6-2018

arXiv.org PDF

Add feedback

Country:
- Europe
  - Spain (0.14)
  - United Kingdom > England (0.14)
- Oceania > Australia (0.14)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks (0.93)
    - Reinforcement Learning (0.71)
    - Statistical Learning > Maximum Entropy (0.61)
  - Representation & Reasoning > Optimization (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found