Variational Regret Bounds for Reinforcement Learning
Gajane, Pratik, Ortner, Ronald, Auer, Peter
For this For reinforcement learning in MDP with changes in reward problem setting, we propose an algorithm and function and transition probabilities, we provide provide performance guarantees for the regret an algorithm, UCRL with Restarts, a version of UCRL evaluated against the optimal non-stationary [Jaksch et al., 2010], which restarts according to a schedule policy. The upper bound on the regret is given dependent on the variation in the MDP (defined in terms of the total variation in the MDP. in Section 2 below). We derive a high-probability upper This is the first variational regret bound for the bound on the cumulative regret of our algorithm of general reinforcement learning setting.
May-23-2019
- Country:
- Europe > Austria (0.14)
- North America > United States (0.20)
- Genre:
- Research Report (0.40)
- Technology: