Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms
–arXiv.org Artificial Intelligence
EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.
arXiv.org Artificial Intelligence
Sep-20-2020
- Country:
- North America > United States
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Illinois > Cook County
- Evanston (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Pennsylvania > Allegheny County
- North America > United States
- Genre:
- Research Report (0.82)
- Technology: