Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Neural Information Processing Systems 

However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of *policy finetuning*, that is, online RL where the learner has additional access to a "reference policy" \mu close to the optimal policy \pi_\star in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an \Omega(H 3S\min\{C \star, A\}/\varepsilon 2) sample complexity lower bound for *any* policy finetuning algorithm, including those that can adaptively explore the environment.