Context-lumpable stochastic bandits

Neural Information Processing Systems 

We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2,... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round.