Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

Aug-5-2020–arXiv.org Machine Learning

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

Aug-5-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Alameda County > Hayward (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (0.68)
  - Artificial Intelligence > Machine Learning
    - Learning Graphical Models > Undirected Networks > Markov Models (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found