Context-lumpable stochastic bandits
–Neural Information Processing Systems
We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2,... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round.
Neural Information Processing Systems
May-25-2025, 16:47:56 GMT