PAC-Bayesian Analysis of Contextual Bandits

Neural Information Processing Systems 

We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) N goes as \sqrt{N I_{\rho_t}(S;A)}, where I_{\rho_t}(S;A) is the mutual information between states and actions (the side information) used by the algorithm at round t . If the algorithm uses all the side information, the regret bound scales as \sqrt{N \ln K}, where K is the number of actions (arms). However, if the side information I_{\rho_t}(S;A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I_{\rho_t}(S;A) 0, the dependence on the number of states reduces from linear to logarithmic.