Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time
A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
Dec-12-2023
- Country:
- Asia
- Europe > United Kingdom
- England
- Cambridgeshire > Cambridge (0.04)
- Greater London > London (0.04)
- England
- North America > United States
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- New York > New York County
- New York City (0.04)
- Wisconsin > Dane County
- Madison (0.04)
- Massachusetts > Middlesex County
- Genre:
- Research Report > Promising Solution (0.34)