Leveraging priors on distribution functions for multi-arm bandits
Vashishtha, Sumit, Maillard, Odalric-Ambrym
We introduce Dirichlet Process Posterior Sampling (DPPS), a Bayesian non-parametric algorithm for multi-arm bandits based on Dirichlet Process (DP) priors. Like Thompson-sampling, DPPS is a probability-matching algorithm, i.e., it plays an arm based on its posterior-probability of being optimal. Instead of assuming a parametric class for the reward generating distribution of each arm, and then putting a prior on the parameters, in DPPS the reward generating distribution is directly modeled using DP priors. DPPS provides a principled approach to incorporate prior belief about the bandit environment, and in the noninformative limit of the DP posteriors (i.e. Bayesian Bootstrap), we recover Non Parametric Thompson Sampling (NPTS), a popular non-parametric bandit algorithm, as a special case of DPPS. We employ stick-breaking representation of the DP priors, and show excellent empirical performance of DPPS in challenging synthetic and real world bandit environments. Finally, using an information-theoretic analysis, we show non-asymptotic optimality of DPPS in the Bayesian regret setup.
Mar-6-2025
- Country:
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- France
- Hauts-de-France > Nord
- Lille (0.04)
- Grand Est > Meurthe-et-Moselle
- Nancy (0.04)
- Hauts-de-France > Nord
- United Kingdom > England
- Asia > Japan
- Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Europe
- Genre:
- Research Report (1.00)
- Industry:
- Food & Agriculture > Agriculture (1.00)
- Education (0.67)