Provably Efficient Q-Learning with Low Switching Cost

Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang

Neural Information Processing Systems 

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost.