Batched Bandits with Crowd Externalities

Laroche, Romain, Safsafi, Othmane, Feraud, Raphael, Broutin, Nicolas

Sep-29-2021–arXiv.org Artificial Intelligence

In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be updated at each time step. Usually, the setting asserts a maximum number of allowed policy updates and the algorithm schedules them so that to minimize the expected regret. In this paper, we describe a novel setting for BMAB, with the following twist: the timing of the policy update is not controlled by the BMAB algorithm, but instead the amount of data received during each batch, called \textit{crowd}, is influenced by the past selection of arms. We first design a near-optimal policy with approximate knowledge of the parameters that we prove to have a regret in $\mathcal{O}(\sqrt{\frac{\ln x}{x}}+\epsilon)$ where $x$ is the size of the crowd and $\epsilon$ is the parameter error. Next, we implement a UCB-inspired algorithm that guarantees an additional regret in $\mathcal{O}\left(\max(K\ln T,\sqrt{T\ln T})\right)$, where $K$ is the number of arms and $T$ is the horizon.

data mining, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

Sep-29-2021

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)
- North America > Canada
  - Quebec (0.14)

Genre:
- Research Report (0.40)

Industry:
- Health & Medicine (0.88)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning
    - Reinforcement Learning (0.46)
  - Data Science > Data Mining
    - Big Data (0.67)