LobsDICE: OfflineLearningfromObservationvia StationaryDistributionCorrectionEstimation

Neural Information Processing Systems 

We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities.