Efficient Exploration for Constrained MDPs

Taleghan, Majid Alkaee (Oregon State University) | Dietterich, Thomas G. (Oregon State University)

AAAI Conferences 

Given a Markov Decision Process (MDP) defined by a simulator, a designated starting state $s_0$, and a downside risk constraint defined as the probability of reaching catastrophic states, our goal is to find a stationary deterministic policy $\pi$ that with probability $1-\delta$ achieves a value $V^\pi(s_0)$ that is within $\epsilon$ of the value of the optimal stationary deterministic $\nu$-feasible policy, $V^*(s_0)$, while economizing on the number of calls to the simulator. This paper presents the first {\bf PAC-Safe-RL} algorithm for this purpose. The algorithm extends PAC-RL algorithms for efficient exploration while providing guarantees that the downside constraint is satisfied. Experiments comparing our {\sc ConstrainedDDV} algorithm to baselines show substantial reductions in the number of simulator calls required to find a feasible policy.