Minimax Policy for Heavy-tailed Multi-armed Bandits
In these contexts, exploration means learning In the worst-case setting, the lower and upper bounds the environment while exploitation means taking empirically are distribution-free. Assuming the rewards are bounded, computed best actions. When finite time performance is concerned, Audibert and Bubeck [6] establish a Ω( KT) lower bound i.e., scenarios in which one cannot learn indefinitely, on the minimax regret. They also studied a modified UCB ensuring a good balance of exploration and exploitation is algorithm called Minimax Optimal Strategy in the Stochastic the key to a good performance. MAB and its variations are case (MOSS) and proved that it achieves an order-optimal prototypical models for these problems, and they are widely worst-case regret while maintaining a logarithm distributiondependent used in many areas such as network routing, recommendation regret. Degenne and Perchet [7] extend MOSS to systems and resource allocation; see [1, Chapter 1]. an anytime version called MOSSanytime. The stochastic MAB problem was originally proposed by The rewards being bounded or sub-Gaussian is a common Robbins [2]. In this problem, at each time, an agent chooses assumption that gives sample mean an exponential an arm from a set of K arms and receives the associated convergence and simplifies the MAB problem.
Jul-20-2020