Mutual Information Regularized Offline Reinforcement Learning
–Neural Information Processing Systems
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold.
Neural Information Processing Systems
Oct-8-2025, 12:30:26 GMT