A Bandit Learning Algorithm and Applications to Auction Design
–Neural Information Processing Systems
We consider online bandit learning in which at every time step, an algorithm has to make a decision and then observe only its reward. The goal is to design efficient (polynomial-time) algorithms that achieve a total reward approximately close to that of the best fixed decision in hindsight. In this paper, we introduce a new notion of (λ, µ)-concave functions and present a bandit learning algorithm that achieves a performance guarantee which is characterized as a function of the concavity parameters λ and µ. The algorithm is based on the mirror descent algorithm in which the update directions follow the gradient of the multilinear extensions of the reward functions. The regret bound induced by our algorithm is Õ( T) which is nearly optimal.
Neural Information Processing Systems
Jan-26-2025, 11:45:45 GMT