Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Jan-19-2025, 11:18:35 GMT–Neural Information Processing Systems

However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of *policy finetuning*, that is, online RL where the learner has additional access to a "reference policy" \mu close to the optimal policy \pi_\star in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an \Omega(H 3S\min\{C \star, A\}/\varepsilon 2) sample complexity lower bound for *any* policy finetuning algorithm, including those that can adaptively explore the environment.

algorithm, bridging sample-efficient offline, offline and online reinforcement learning, (9 more...)

Neural Information Processing Systems

Jan-19-2025, 11:18:35 GMT

Conferences Web Page

Add feedback

Genre:
- Instructional Material > Online (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.59)