Safe Policy Improvement by Minimizing Robust Baseline Regret

Mohammad Ghavamzadeh, Marek Petrik, Yinlam Chow

May-1-2026, 05:55:55 GMT–Neural Information Processing Systems

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new model-based approach that computes a safe policy, given an inaccurate model of the system's dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

May-1-2026, 05:55:55 GMT

Conferences PDF

Add feedback

Industry:
- Energy (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.49)
  - Machine Learning > Reinforcement Learning (0.34)

Duplicate Docs Excel Report

Title
Safe Policy Improvement by Minimizing Robust Baseline Regret
Safe Policy Improvement by Minimizing Robust Baseline Regret

Similar Docs Excel Report more

Title	Similarity	Source
None found