Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

Jan-19-2025, 08:09:22 GMT–Neural Information Processing Systems

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a K -armed bandit with exponential family rewards, ExpTS over a horizon T is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor \sqrt{\log K}, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms.

exponential family multi-armed bandit, reward distribution, thompson sampling algorithm, (4 more...)

Neural Information Processing Systems

Jan-19-2025, 08:09:22 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (0.67)
  - Data Science > Data Mining
    - Big Data (0.54)