Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs
Kash, Ian A., Reyzin, Lev, Yu, Zishun
–arXiv.org Artificial Intelligence
Reinforcement learning generalizes bandit problems with additional difficulties on longer planning horizon and unknown transition kernel. We show that, under some mild assumptions, *any* slowly changing adversarial bandit algorithm enjoys optimal regret in adversarial bandits can achieve optimal (in the dependency of $T$) expected regret in infinite-horizon discounted MDPs, without the presence of Bellman backups. The slowly changing property required by our generalization is mild, which is also marked by the online Markov decision process literature. We also examine the applicability of our reduction to a well-known adversarial bandit algorithm, EXP3.
arXiv.org Artificial Intelligence
Feb-9-2023
- Country:
- North America > United States
- Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom
- England > Greater London > London (0.04)
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Genre:
- Research Report (0.50)