AITopics | psrl

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Neural Information Processing SystemsApr-26-2026, 21:57:16 GMT

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model value. Without harmful sampling procedures, CDPO can still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously.

Add feedback

d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf

Neural Information Processing SystemsFeb-19-2026, 07:24:22 GMT

algorithm, fmdp, mdp, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Add feedback

Scalar Posterior Sampling with Applications

Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, Nikos Vlassis

Neural Information Processing SystemsFeb-14-2026, 10:31:18 GMT

Peter L learning UAI, pages Dimitri Dynamic, Belmont, Ronen I optimal Journal Aditya processes.

artificial intelligence, machine learning, ouyangetal, (12 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

David Janz, Jiri Hron, Przemysław Mazur, Katja Hofmann, José Miguel Hernández-Lobato, Sebastian Tschiatschek

Neural Information Processing SystemsFeb-11-2026, 14:47:15 GMT

Specifically,becauseaQfunctionis defined with respect toaparticular policy,constructingPˆQ requires selection ofareference policy or distribution over policies.

etal, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.70)

Add feedback

h eft(s,a) f (|s,a) 1 i +Eρπt 1 h eft(s,a) f (|s,a) 1 i, (A.1) where (t): = Es ζ V

Neural Information Processing SystemsFeb-11-2026, 02:53:03 GMT

From the Posterior Sampling Lemma, we know that ifψ is the distribution off, then for any sigma-algebraσ(Ht)-measurablefunctiong, E[g(f)|Ht]=E[g(ft)|Ht]. We can further know from the construction of the confidence set (c.f. This lemma is widely adopted in RL. Proof can be found in various previous works, e.g. Prior work that shares similarities with ours contains DPI [59]and GPS [31,39]as dual policyoptimization procedures areadopted.

algorithm, artificial intelligence, eft, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.69)

Add feedback

ConservativeDualPolicyOptimizationforEfficient Model-Based ReinforcementLearning

Neural Information Processing SystemsFeb-11-2026, 02:52:59 GMT

Based ontheprinciple ofoptimism inthefaceofuncertainty(OFU) [56,49,10],OFU-RL achievestheglobal optimality by ensuring that the optimistically biased value is close to the real value in the long run. Based on Thompson Sampling [62], Posterior Sampling RL (PSRL) [57, 42, 43] explores by greedily optimizing the policy in an MDP which is sampled from the posterior distribution over MDPs.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Washington > King County > Seattle (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)

Add feedback

OnEfficiencyinHierarchicalReinforcement Learning

Neural Information Processing SystemsFeb-8-2026, 08:16:23 GMT

While this has been demonstrated empirically overtimeinavarietyoftasks,theoretical resultsquantifying thebenefits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRLmethods.

artificial intelligence, machine learning, submdp, (15 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.49)

Add feedback

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing SystemsDec-25-2025, 02:10:16 GMT

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.

exploration and uncertainty, successor uncertainty, temporal difference learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.82)

Add feedback

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Neural Information Processing SystemsDec-24-2025, 21:53:24 GMT

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model value. Without harmful sampling procedures, CDPO can still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously.

conservative dual policy optimization, efficient model-based reinforcement learning, model-based reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

A Proofs

Neural Information Processing SystemsAug-17-2025, 08:37:16 GMT

We lay out the proof in two major steps. From the Performance Difference Lemma B.2, we obtain J (q Combining with (A.4) gives us the iterative improvement bound as follows: J ( π From the Simulation Lemma B.1, we have the bound of We can further know from the construction of the confidence set (c.f. Similar with the proof in A.2, we obtain from the Simulation Lemma B.1 that Enull null null V The claim is thus established. This lemma is widely adopted in RL. Proof can be found in various previous works, e.g.

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

Filters

Collaborating Authors

psrl

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf

Scalar Posterior Sampling with Applications

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

h eft(s,a) f (|s,a) 1 i +Eρπt 1 h eft(s,a) f (|s,a) 1 i, (A.1) where (t): = Es ζ V

ConservativeDualPolicyOptimizationforEfficient Model-Based ReinforcementLearning

OnEfficiencyinHierarchicalReinforcement Learning

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

A Proofs