diameter
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(D\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\tilde{O}(DS\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\Omega(\sqrt{DSAT})$ for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
- Asia > Singapore (0.04)
- North America > Canada (0.04)
- North America > United States (0.04)
- North America > Canada (0.04)
- Europe > France > Hauts-de-France > Pas-de-Calais (0.04)
- Europe > Austria > Styria > Leoben (0.04)
NearOptimalExploration-Exploitationin Non-CommunicatingMarkovDecisionProcesses
Reinforcement learning (RL) [1] studies the problem of learning in sequential decision-making problems where the dynamics of the environment is unknown, but can be learnt by performing actions andobserving their outcome inanonline fashion. Asample-efficient RLagent must trade off the explorationneeded to collect information about the environment, and theexploitation of the experience gathered so far to gain as much reward as possible.
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)
- North America > United States > California (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Banking & Finance > Trading (1.00)
- Information Technology > Services > e-Commerce Services (0.47)
- Information Technology > e-Commerce > Financial Technology (1.00)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Mining (0.96)
- Asia > Singapore (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Services > e-Commerce Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance > Trading (1.00)
- Information Technology > e-Commerce > Financial Technology (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications > Networks (1.00)
- (6 more...)
- North America > United States > New Jersey (0.04)
- North America > United States > Massachusetts (0.04)
- North America > Canada > Quebec > Montreal (0.04)