Frequentist Regret Bounds for Randomized Least-Squares Value Iteration
Zanette, Andrea, Brandfonbrener, David, Pirotta, Matteo, Lazaric, Alessandro
A key challenge in reinforcement learning (RL) is how to bala nce exploration and exploitation in order to efficiently learn to make good sequences of decisions in a way that is both computationally tractable and statistically efficient. In the tabular case, the exploration-exploitation problem is well-understood for a number of settings (e.g., finite-hori zon, average reward, infinite horizon with discount), explorati on objectives (e.g., regret minimization and probably approximately correct), and for different algorithmic appro aches, where optimism-under-uncertainty [JOA10, FPLO18] and Thompson sampling (TS) [OBPVR16, Rus19] are the most pop ular principles. For instance, in the finite-horizon setting, [AOM17] and [ZB19] recently derived minimax optim al and structure adaptive regret bounds for optimistic exploration algorithms. TSbased algorithms have mainly b een analyzed in tabular MDPs in terms of Bayesian regret [OBPVR16, OR17, OGNJ17], which assumes that the MDP is s ampled from a known prior distribution. These bounds do not hold against a fixed MDP and algorithms with smal l Bayesian regret may still suffer high regret in some hard-to-learn MDPs within the chosen prior. In the tabu lar setting, frequentist (or worst-case) regret analysis h as been developed for TSbased algorithms both in the average r eward [GM15, AJ17] and finite-horizon case [Rus19]. Despite the fact that TSbased approaches have slightly wor se regret bounds compared to optimism-based algorithms, their empirical performance is often superior [CL11, OR17] . Unfortunately, the performance of tabular exploration met hods rapidly degrades with the number of states and actions, thus making them unfeasible in large or continuous MD Ps.
Nov-1-2019