Partition Tree Weighting for Non-Stationary Stochastic Bandits

Veness, Joel, Hutter, Marcus, Gyorgy, Andras, Grau-Moya, Jordi

arXiv.org Artificial Intelligence 

In contrast to popular decision-making frameworks such as reinforcement learning, which are built upon appealing to decision-theoretic notions such as Maximum Expected Utility, we instead construct an agent by trying to minimise the expected number of bits needed to losslessly describe general agent-environment interactions. The appeal with this approach is that if we can construct a good universal coding scheme for arbitrary agent interactions, one could simply sample from this coding distribution to generate a control policy. However when considering general agents, whose goal is to work well across multiple environments, this question turns out to be surprisingly subtle. Naive approaches which do not discriminate between actions and observations fail, and are subject to the self-delusion problem [Ortega et al., 2021]. In this work, we will adopt a universal source coding perspective to this question, and showcase its efficacy by applying it to the challenging non-stationary stochastic bandit problem. In the passive case, namely, sequential prediction of observations under the logarithmic loss, there is a well developed universal source coding literature for dealing with non-stationary sources under various types of non-stationarity. The most influential idea for modelling piecewise stationary sources is the transition diagram technique of Willems [1996]. This technique performs Bayesian model averaging over all possible partitions of a sequence of data, cleverly exploiting dynamic programming to yield an algorithm which has both quadratic time complexity and provable regret guarantees.