REINFORCE Converges to Optimal Policies with Any Learning Rate

Jun-10-2026, 17:22:29 GMT–Neural Information Processing Systems

We prove that the classic REINFORCE stochastic policy gradient (SPG) method converges to globally optimal policies in finite-horizon Markov Decision Processes (MDPs) with $\textit{any}$ constant learning rate. To avoid the need for small or decaying learning rates, we introduce two key innovations in the stochastic bandit setting, which we then extend to MDPs.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Jun-10-2026, 17:22:29 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)