Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits
–Neural Information Processing Systems
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a K -armed bandit with exponential family rewards, ExpTS over a horizon T is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor \sqrt{\log K}, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms.
Neural Information Processing Systems
Jan-19-2025, 08:09:22 GMT
- Technology: