Near-optimal Regret Bounds for Reinforcement Learning

Auer, Peter, Jaksch, Thomas, Ortner, Ronald

Dec-31-2009–Neural Information Processing Systems

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Dec-31-2009

Conferences PDF

Add feedback

Country:
- Europe > Austria (0.14)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found