Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Neural Information Processing Systems 

However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" µ close to the optimal policy π