ConservativeDualPolicyOptimizationforEfficient Model-Based ReinforcementLearning
–Neural Information Processing Systems
Based ontheprinciple ofoptimism inthefaceofuncertainty(OFU) [56,49,10],OFU-RL achievestheglobal optimality by ensuring that the optimistically biased value is close to the real value in the long run. Based on Thompson Sampling [62], Posterior Sampling RL (PSRL) [57, 42, 43] explores by greedily optimizing the policy in an MDP which is sampled from the posterior distribution over MDPs.
Neural Information Processing Systems
Feb-11-2026, 02:52:59 GMT