Batched Dueling Bandits
Agarwal, Arpit, Ghuge, Rohan, Nagarajan, Viswanath
The K-armed dueling bandits problem has been widely studied in machine learning due to its applications in search ranking, recommendation systems, sports ranking, etc. [3, 14, 16, 26, 29, 30, 34, 38, 41, 43-46]. It is a variation of the traditional stochastic bandit problem in which feedback is obtained in the form of pairwise preferences. This problem falls under the umbrella of preference learning [39], where the goal is to learn from relative feedback (in our case, given two alternatives, which of the two is preferred). Designing learning algorithms for such relative feedback becomes crucial in domains where qualitative feedback is easily obtained, but real-valued feedback would be arbitrary or not interpretable. We illustrate this using the web-search ranking application. Web-search ranking is an example of a complex information retrieval system, where the goal is to provide a list (usually ranked) of candidate documents to the user of the system in response to a query [25, 27, 33, 42]. Modern day search engines comprise hundreds of parameters which are used to output a ranked list in response to a query. However, manually tuning these parameters can sometimes be infeasible, and online learning frameworks (based on user feedback) have been invaluable in automatically tuning these parameters [31]. These methods do not affect user experience, enable the system to continuously learn about user preferences, and thus continuously adapt to user behavior.
Feb-21-2022