NavigatingtotheBestPolicy inMarkovDecisionProcesses

Neural Information Processing Systems 

Somewhat surprisingly, learning in a Markov Decision Process is most often considered under the performance criteria ofconsistency or regret minimization(see e.g.