Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Scholl, Philipp, Dietrich, Felix, Otte, Clemens, Udluft, Steffen

Jan-28-2022–arXiv.org Artificial Intelligence

Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.

algorithm, safe policy improvement approach, state-action pair, (12 more...)

arXiv.org Artificial Intelligence

Jan-28-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts (0.04)
  - California > Los Angeles County
    - Santa Monica (0.04)
- Europe > Germany
  - Bavaria > Upper Bavaria > Munich (0.05)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.69)
  - Machine Learning
    - Reinforcement Learning (0.95)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.86)