Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Dec-5-2019–arXiv.org Machine Learning

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|^2\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure {$\mathcal{\tilde{O}}(\sqrt{T})$} regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

probability, rosenberg & mansour, transition function, (12 more...)

arXiv.org Machine Learning

Dec-5-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found