REINFORCE Converges to Optimal Policies with Any Learning Rate
–Neural Information Processing Systems
We prove that the classic REINFORCE stochastic policy gradient (SPG) method converges to globally optimal policies in finite-horizon Markov Decision Processes (MDPs) with $\textit{any}$ constant learning rate. To avoid the need for small or decaying learning rates, we introduce two key innovations in the stochastic bandit setting, which we then extend to MDPs.
Neural Information Processing Systems
Jun-10-2026, 17:22:29 GMT
- Technology: