Stochastic Linear Optimization with Adversarial Corruption
Li, Yingkai, Lou, Edmund Y., Shan, Liren
The multi-armed bandit problem has been extensively studie d in computer science, operations research and economics since the seminal work of Robb ins (1952). It is a model designed for sequential decision-making in which a player c hooses at each time step amongst a finite set of available arms and receives a reward for the cho sen decision. The player's objective is to minimize the difference, called regret, betwee n the rewards she receives and the rewards accumulated by the best arm. The rewards of each arm i s drawn from a probability distribution in the stochastic multi-armed bandit problem; but in adversarial multi-armed bandit models, there is typically no assumption imposed on t he sequence of rewards received by the player. In recent work, Lykouris et al. (2018) introduce a model in wh ich an adversary could corrupt the stochastic reward generated by an arm pull. They provide an algorithm and show that the regret of this "middle ground" scenario degrad es smoothly with the amount of corruption injected by the adversary. Gupta et al. (2019) pr esent an alternative algorithm which gives a significant improvement.
Sep-4-2019