LAPO: Latent-VariableAdvantage-WeightedPolicy OptimizationforOfflineReinforcementLearning

Neural Information Processing Systems 

But in practice, it requires querying the behavior policy which is unknown, and using an erroneous approximation of the behavior policy can negatively affect the performance ([39]).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found