Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Neural Information Processing Systems 

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model.There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size K . Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of \Omega(d\sqrt{\smash[b]{T/K}}) and propose a constant-time algorithm, OFU-MNL, that achieves a matching upper bound of \tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}}) . We also provide instance-dependent minimax regret bounds under uniform rewards.Under non-uniform rewards, we prove a lower bound of \Omega(d\sqrt{T}) and an upper bound of \tilde{\mathcal{O}}(d\sqrt{T}), also achievable by OFU-MNL . Our empirical studies support these theoretical findings.