From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

Tiapkin, Daniil, Belomestny, Denis, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey, Tang, Yunhao, Valko, Michal, Menard, Pierre

Jun-22-2022–arXiv.org Machine Learning

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

Jun-22-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Maryland > Baltimore (0.04)
    - New York > New York County
      - New York City (0.04)
    - Massachusetts > Norfolk County
      - Wellesley (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Germany > Saxony-Anhalt (0.04)
  - Spain > Canary Islands (0.04)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Hungary > Budapest
    - Budapest (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.63)

Industry:
- Leisure & Entertainment (0.67)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning
      - Reinforcement Learning (1.00)
      - Neural Networks (0.92)
      - Learning Graphical Models > Undirected Networks
        Markov Models (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found