Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

Open in new window