Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret
Kim, Yeoneung, Kim, Gihun, Yang, Insoon
Balancing the exploration-exploitation trade-off is a fundamental dilemma in reinforcement learning (RL). This issue has been systemically addressed in two main approaches, namely optimism in the face of uncertainty (OFU) and Thompson sampling (TS). The methods using OFU first construct confidence sets for the environment or model parameters given the samples observed so far. After finding the reward-maximizing or optimistic parameters within the confidence set, an optimal policy with respect to the parameters is constructed and executed [1]. Various algorithms using OFU are shown to have strong theoretical guarantees in bandits [2]. On the other hand, TS is a Bayesian method in which environment or model parameters are sampled from the posterior that is updated along the process using samples and a prior, and an optimal policy with respect to the sampled parameter is constructed and executed [3].
May-28-2024