Batched Bandits with Crowd Externalities
Laroche, Romain, Safsafi, Othmane, Feraud, Raphael, Broutin, Nicolas
–arXiv.org Artificial Intelligence
In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be updated at each time step. Usually, the setting asserts a maximum number of allowed policy updates and the algorithm schedules them so that to minimize the expected regret. In this paper, we describe a novel setting for BMAB, with the following twist: the timing of the policy update is not controlled by the BMAB algorithm, but instead the amount of data received during each batch, called \textit{crowd}, is influenced by the past selection of arms. We first design a near-optimal policy with approximate knowledge of the parameters that we prove to have a regret in $\mathcal{O}(\sqrt{\frac{\ln x}{x}}+\epsilon)$ where $x$ is the size of the crowd and $\epsilon$ is the parameter error. Next, we implement a UCB-inspired algorithm that guarantees an additional regret in $\mathcal{O}\left(\max(K\ln T,\sqrt{T\ln T})\right)$, where $K$ is the number of arms and $T$ is the horizon.
arXiv.org Artificial Intelligence
Sep-29-2021
- Country:
- Europe (0.28)
- North America > Canada
- Quebec (0.14)
- Genre:
- Research Report (0.40)
- Industry:
- Health & Medicine (0.88)
- Technology: