DeepMind papers at ICML 2017 (part one) DeepMind
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of order (HSAT)1/2 (up to a logarithmic factor) where H is the time horizon, S the number of states, A the number of actions and T the number of time-steps. This result improves over the best previous known bound HS(AT)1/2 achieved by the UCRL2 algorithm of [Jaksch, Ortner, Auer, 2010]. The key significance of our new results is that for large T, the sample complexity of our algorithm matches the optimal lower bound of Ω(HSAT)1/2. Our analysis contains two key insights.
Aug-7-2017, 17:10:05 GMT
- Technology: