Model-free Posterior Sampling via Learning Rate Randomization
–Neural Information Processing Systems
In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order \widetilde{\mathcal{O}}(\sqrt{H {5}SAT}), where H is the planning horizon, S is the number of states, A is the number of actions, and T is the number of episodes. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization.
Neural Information Processing Systems
Jan-20-2025, 01:29:23 GMT
- Technology: