Neural Dueling Bandits

Verma, Arun, Dai, Zhongxiang, Lin, Xiaoqiang, Jaillet, Patrick, Low, Bryan Kian Hsiang

Jul-24-2024–arXiv.org Machine Learning

Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.

algorithm, bandit, reward function, (15 more...)

arXiv.org Machine Learning

Jul-24-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.04)
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning
    - Neural Networks (0.68)
  - Data Science > Data Mining
    - Big Data (0.69)