Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Jan-16-2025, 12:58:35 GMT–Neural Information Processing Systems

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of \tilde{\mathcal{O}}(\sqrt{K}) while allowing an \tilde{\mathcal{O}}(\sqrt{K}) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order \tilde{\mathcal{O}}(\sqrt{K}) .

bounded constraint violation, constrained mdp, learning policy, (7 more...)

Neural Information Processing Systems

Jan-16-2025, 12:58:35 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Constraint-Based Reasoning (1.00)
  - Machine Learning (0.82)