Trusted Approximate Policy Iteration with Bisimulation Metrics

Feb-6-2022–arXiv.org Artificial Intelligence

Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation. In this work we first prove that bisimulation metrics can be defined via any $p$-Wasserstein metric for $p\geq 1$. Then we describe an approximate policy iteration (API) procedure that uses $\epsilon$-aggregation with $\pi$-bisimulation and prove performance bounds for continuous state spaces. We bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these theoretical results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem. Finally, we provide experimental evidence of improved stability compared to non-conservative alternatives in simulated continuous control.

algorithm, bisimulation metric, kemerta & aumentado-armstrong, (12 more...)

arXiv.org Artificial Intelligence

Feb-6-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Virginia > Arlington County
      - Arlington (0.04)
    - Massachusetts > Middlesex County
      - Belmont (0.04)
    - California
      - San Francisco County > San Francisco (0.14)
      - Alameda County > Berkeley (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Europe > France
  - Hauts-de-France > Nord > Lille (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.86)
  - Machine Learning > Learning Graphical Models
    - Undirected Networks > Markov Models (0.34)