Worst-Case Regret Bounds for Exploration via Randomized Value Functions
–arXiv.org Artificial Intelligence
Exploration is one of the central challenges in reinforcement learning (RL). A large theoretical literature treats exploration in simple finite state and action MDPs, showing that it is possible to efficiently learn a near optimal policy through interaction alone [5, 8, 10, 11, 13-16, 24, 25]. Overwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly maintaining uncertainty sets that are likely to contain the true MDP. It has been difficult to adapt these exploration algorithms to the more complex problems investigated in the applied RL literature. Most applied papers seem to generate exploration through ǫ-greedy or Boltzmann exploration. Those simple methods are compatible with practical value function learning algorithms, which use parametric approximations to value functions to generalize across high dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in simple finite state MDPs [See e.g.
arXiv.org Artificial Intelligence
Jun-6-2019
- Country:
- Europe > United Kingdom
- England
- Oxfordshire > Oxford (0.04)
- Greater London > London (0.04)
- England
- Asia > Middle East
- Jordan (0.04)
- Europe > United Kingdom
- Genre:
- Research Report (0.64)
- Technology: