Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

Feb-17-2021–arXiv.org Machine Learning

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We propose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

algorithm, inequality hold, lemma 6, (13 more...)

arXiv.org Machine Learning

Feb-17-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning
    - Uncertainty > Fuzzy Logic (0.61)
    - Optimization (0.48)
  - Machine Learning
    - Reinforcement Learning (0.67)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.54)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found