Bandits with Side Observations: Bounded vs. Logarithmic Regret
Degenne, Rémy, Garcelon, Evrard, Perchet, Vianney
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $\epsilon$, an extra observation is gathered by the agent for free. We prove that, no matter how small $\epsilon$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/\epsilon)}{\Delta_i}$, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
Jul-10-2018
- Country:
- Europe > France > Île-de-France
- Val-de-Marne > Cachan (0.04)
- Paris > Paris (0.04)
- Europe > France > Île-de-France
- Genre:
- Research Report (0.64)
- Industry:
- Energy (0.47)
- Technology: