Appendix: Online Learning in Contextual Bandits using Gated Linear Networks Marcus Hutter

Neural Information Processing Systems 

The algorithm operates on a complete binary tree of depth D that maintains a GLN at each non-leaf node. Given a context x, a tree estimating the value of an action, i.e. the GLN parameters for each non-leaf node of the tree (Θ We should note that even though this exponential term might initially seem discouraging, we set D = 3 in our experiments and observe no significant improvements for larger D. This is also consistent with findings from distributional RL [BDM17], where a surprisingly small number of bins/quantiles are sufficient for state of the art performance on Atari games. Below, we provide the proofs for the Asymptotic Convergence section. A for which the pseudocount in signature s is bounded, i.e. We will show that this leads to a contradiction.