Reviews: Bandit Learning with Implicit Feedback

Neural Information Processing Systems 

Summary: This work considers learning user preferences using a bandit model. The reward is not only based on the judgement of the user, but also whether the user examined the arm. That is feedback examination * judgement In particular, if a user does not examine an arm, lack of feedback does not necessarily indicate that the user does not "like" the arm. This work uses a latent model for the (unobserved) examination of arms, and posits that the probability of positive feedback (binary) can be expressed as a product of the probability of examination (logistic) and positive feedback (logistic). The work proposes a VI approach to estimating the parameters, and then use a Thompson Sampling approach from the approximate posterior as policy. This allows them to use machinery from Russo and Van Roy to obtain regret bounds.