OfflineRLWithoutOff-PolicyEvaluation
–Neural Information Processing Systems
Inaddition, wehypothesize thatthestrong performance of the one-step algorithm is due to a combination of favorable structure in the environmentandbehaviorpolicy.
Neural Information Processing Systems
Feb-7-2026, 22:53:31 GMT