Mutual Information Regularized Offline Reinforcement Learning

Neural Information Processing Systems 

We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found