Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Neural Information Processing Systems 

We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log |Π| + log |Π|) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem.