Appendix: Online Learning in Contextual Bandits using Gated Linear Networks Marcus Hutter
–Neural Information Processing Systems
The algorithm operates on a complete binary tree of depth D that maintains a GLN at each non-leaf node. Given a context x, a tree estimating the value of an action, i.e. the GLN parameters for each non-leaf node of the tree (Θ We should note that even though this exponential term might initially seem discouraging, we set D = 3 in our experiments and observe no significant improvements for larger D. This is also consistent with findings from distributional RL [BDM17], where a surprisingly small number of bins/quantiles are sufficient for state of the art performance on Atari games. Below, we provide the proofs for the Asymptotic Convergence section. A for which the pseudocount in signature s is bounded, i.e. We will show that this leads to a contradiction.
Neural Information Processing Systems
Feb-7-2025, 08:49:22 GMT
- Genre:
- Research Report > New Finding (0.74)