Reviews: Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems 

I still feel that the work would be greatly improved by adding numerical experiments. In particular, the authors refer to a specific setting called'online MDP', where the dynamics, that is, the transition probabilities, are known while the reward is not. Regret minimization then refers to the idea to minimize the regret'' given that rewards could be chosen/observed in an adversarial manner. The authors start with a (rather technical) introduction, pose related work, and explain the main ideas based on concise preliminaries. Afterwards, an extension to large state spaces by using approximate occupancy measures and thereby avoiding concrete state-mappings is provided.