RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning

Neural Information Processing Systems 

The model is trained to minimise the value function while still accurately predicting the transitions in the dataset, forcing the policy to act conservatively in areas not covered by the dataset. To approximately solve the two-player game, we alternate between optimising the policy and adversarially optimising the model.