Stochastic Graph Bandit Learning with Side-Observations

Gong, Xueping, Zhang, Jiheng

arXiv.org Artificial Intelligence 

The bandit framework has garnered significant attention from the online learning community due to its widespread applicability in diverse fields such as recommendation systems, portfolio selection, and clinical trials [21]. Among the significant aspects of sequential decision making within this framework are side observations, which can be feedback from multiple sources [25] or contextual knowledge about the environment [1, 2]. These are typically represented as graph feedback and contextual bandits respectively. The multi-armed bandits framework with feedback graphs has emerged as a mature approach, providing a solid theoretical foundation for incorporating additional feedback into the exploration strategy [4, 7, 3]. The contextual bandit problem is another well-established framework for decisionmaking under uncertainty [20, 11, 1]. Despite the considerable attention given to non-contextual bandits with feedback graphs, the exploration of contextual bandits with feedback graphs has been limited [32, 30, 28].