Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Zanette, Andrea, Brandfonbrener, David, Pirotta, Matteo, Lazaric, Alessandro

arXiv.org Machine Learning 

A key challenge in reinforcement learning (RL) is how to bala nce exploration and exploitation in order to efficiently learn to make good sequences of decisions in a way that is both computationally tractable and statistically efficient. In the tabular case, the exploration-exploitation problem is well-understood for a number of settings (e.g., finite-hori zon, average reward, infinite horizon with discount), explorati on objectives (e.g., regret minimization and probably approximately correct), and for different algorithmic appro aches, where optimism-under-uncertainty [JOA10, FPLO18] and Thompson sampling (TS) [OBPVR16, Rus19] are the most pop ular principles. For instance, in the finite-horizon setting, [AOM17] and [ZB19] recently derived minimax optim al and structure adaptive regret bounds for optimistic exploration algorithms. TSbased algorithms have mainly b een analyzed in tabular MDPs in terms of Bayesian regret [OBPVR16, OR17, OGNJ17], which assumes that the MDP is s ampled from a known prior distribution. These bounds do not hold against a fixed MDP and algorithms with smal l Bayesian regret may still suffer high regret in some hard-to-learn MDPs within the chosen prior. In the tabu lar setting, frequentist (or worst-case) regret analysis h as been developed for TSbased algorithms both in the average r eward [GM15, AJ17] and finite-horizon case [Rus19]. Despite the fact that TSbased approaches have slightly wor se regret bounds compared to optimism-based algorithms, their empirical performance is often superior [CL11, OR17] . Unfortunately, the performance of tabular exploration met hods rapidly degrades with the number of states and actions, thus making them unfeasible in large or continuous MD Ps.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found