Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

May-27-2025, 15:46:58 GMT–Neural Information Processing Systems

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model.There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size K . Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of \Omega(d\sqrt{\smash[b]{T/K}}) and propose a constant-time algorithm, OFU-MNL, that achieves a matching upper bound of \tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}}) . We also provide instance-dependent minimax regret bounds under uniform rewards.Under non-uniform rewards, we prove a lower bound of \Omega(d\sqrt{T}) and an upper bound of \tilde{\mathcal{O}}(d\sqrt{T}), also achievable by OFU-MNL . Our empirical studies support these theoretical findings.

artificial intelligence, machine learning, multinomial logistic bandit, (9 more...)

Neural Information Processing Systems

May-27-2025, 15:46:58 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (0.69)
  - Machine Learning (0.67)