DeepMind papers at ICML 2017 (part one) DeepMind

@machinelearnbot 

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of order (HSAT)1/2 (up to a logarithmic factor) where H is the time horizon, S the number of states, A the number of actions and T the number of time-steps. This result improves over the best previous known bound HS(AT)1/2 achieved by the UCRL2 algorithm of [Jaksch, Ortner, Auer, 2010]. The key significance of our new results is that for large T, the sample complexity of our algorithm matches the optimal lower bound of Ω(HSAT)1/2. Our analysis contains two key insights.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found