LAPO: Latent-VariableAdvantage-WeightedPolicy OptimizationforOfflineReinforcementLearning
–Neural Information Processing Systems
But in practice, it requires querying the behavior policy which is unknown, and using an erroneous approximation of the behavior policy can negatively affect the performance ([39]).
Neural Information Processing Systems
Feb-12-2026, 18:26:51 GMT