Reviews: Bandit Learning with Implicit Feedback
–Neural Information Processing Systems
Summary: This work considers learning user preferences using a bandit model. The reward is not only based on the judgement of the user, but also whether the user examined the arm. That is feedback examination * judgement In particular, if a user does not examine an arm, lack of feedback does not necessarily indicate that the user does not "like" the arm. This work uses a latent model for the (unobserved) examination of arms, and posits that the probability of positive feedback (binary) can be expressed as a product of the probability of examination (logistic) and positive feedback (logistic). The work proposes a VI approach to estimating the parameters, and then use a Thompson Sampling approach from the approximate posterior as policy. This allows them to use machinery from Russo and Van Roy to obtain regret bounds.
Neural Information Processing Systems
Oct-8-2024, 05:58:29 GMT