Efficient Action Poisoning Attacks on Linear Contextual Bandits

Liu, Guanlin, Lai, Lifeng

arXiv.org Machine Learning 

Multiple armed bandits (MABs), a popular framework of sequential decision making model, has been widely investigated and has many applicants in a variety of scenarios [1, 2, 3]. The contextual bandits model is an extension of the multi-armed bandits model with contextual information. At each round, the reward is associated with both the arm (a.k.a, action) and the context, while the reward of stochastic MABs is only associated with the arm. Contextual bandits algorithms have a broad range of applications, such as recommender systems [4], wireless networks [5], etc. In the modern industry-scale applications of bandit algorithms, action decisions, reward signal collection, and policy iterations are normally implemented in a distributed network.