Reviews: Context-dependent upper-confidence bounds for directed exploration

Neural Information Processing Systems 

Summary: In this paper, the authors have described how context dependent, i.e. state and action pair dependent, Upper Confidence Bounds can be calculated for the Least Squares Temporal Difference (LSTD) Learning algorithm. They argue that this is a principled way of exploring the interactions with the environment because is specially important for least square methods because other strategies are either inefficient (e-greedy) or cannot be used in a meaningful manner (optimistic initialization). The computation of the upper confidence part is interesting and deviates from the standard proofs for Multi-arm bandit cases because the samples in case of LSTD are not independent of each other. The authors address this issue by imposing a bound using the (weaker) Chebychev's inequality instead of the Bernstein's inequality. In their evaluation, they use a bound on the total compute rather than on the number of episodes.