A nearly Blackwell-optimal policy gradient method

Jun-3-2021–arXiv.org Artificial Intelligence

For continuing environments, reinforcement learning methods commonly maximize a discounted reward criterion with discount factor close to 1 in order to approximate the steady-state reward (the gain). However, such a criterion only considers the long-run performance, ignoring the transient behaviour. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias, and its preconditioning Fisher matrix. We further propose an algorithm that solves the corresponding bi-level optimization using a logarithmic barrier. Experimental results provide insights into the fundamental mechanisms of our proposal.

gradient, optimality, optimization, (14 more...)

arXiv.org Artificial Intelligence

Jun-3-2021

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Queensland (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Indiana > Hamilton County
    - Fishers (0.04)
- Europe
  - Netherlands (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.93)
  - Machine Learning
    - Reinforcement Learning (0.86)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found