A Short Note on Soft-max and Policy Gradients in Bandits Problems

Jul-20-2020–arXiv.org Machine Learning

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Machine Learning

Jul-20-2020

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning > Gradient Descent (0.33)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found