The goal of our work was easy reproducibility and clearly showing the benefits of learning to explore over the state

Neural Information Processing Systems 

Thank you for the reviews of our paper. We will revise the paper accordingly. We discuss a contextual extension in Section 8. The policies for longer horizons also perform well and outperform TS. This can be seen in the proof in Appendix C, which only requires that γ = 1 /θ [1 /8, 1).