Bi-Level Offline Policy Optimization with Limited Exploration

Dec-26-2025, 13:22:26 GMT–Neural Information Processing Systems

We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level.

bi-level offline policy optimization, exploration, name change, (4 more...)

Neural Information Processing Systems

Dec-26-2025, 13:22:26 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.76)
  - Representation & Reasoning (0.59)