Reviews: Exploration in Structured Reinforcement Learning

Neural Information Processing Systems 

It provides problem-related (asymptotic) lower and upper bounds on the regret, the latter for an algorithm presented in the paper that builds on Burnetas and Katehakis (1997) and a recent bandit paper by Combes et al (NIPS 2017). The setting assumes that an "MDP structure" \Phi (i.e. a set of possible MDP models) is given. The regret bounds (after T steps) are shown to be of the form K_Phi*log T, where the parameter K_\Phi is the solution to a particular optimization problem. It is shown that if \Phi is the set of all MDPs ("the unstructured case") then K_\Phi is bounded by HSA/\delta, where H is the bias span and \delta the minimal action sub-optimality gap. The second particular class that is considered is the Lipschitz structure that considers embeddings of finite MDPs in Euclidian space such that transition probabilities and rewards are Lipschitz. In this case, the regret bounds are shown to not to depend on the size of state and action space anymore.