Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Jan-18-2025, 07:38:41 GMT–Neural Information Processing Systems

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update.

conservative dual policy optimization, efficient model-based reinforcement learning, model-based reinforcement learning, (3 more...)

Neural Information Processing Systems

Jan-18-2025, 07:38:41 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)