AITopics | policy gradient method

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2605.10671

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Add feedback

Finite-Time Analysis of Single-Timescale Actor-Critic

Neural Information Processing SystemsMay-1-2026, 02:03:41 GMT

Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

On the convergence of policy gradient methods to Nash equilibria in general stochastic games Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsApr-25-2026, 07:56:15 GMT

Multi-agent learning in stochastic N-player games is a notoriously difficult problem1 because, in addition to their changing strategic decisions, the players of the game2 must also contend with the fact that the game itself evolves over time, possibly in a3 very complicated manner. Because of this, the equilibrium convergence properties4 of popular learning algorithms - like policy gradient and its variants - are poorly5 understood, except in specific classes of games (such as potential or two-player,6 zero-sum games). In view of all this, we examine the long-run behavior of policy7 gradient methods with respect to Nash equilibrium policies that are second-order8 stationary (SOS) in a sense similar to the type of KKT sufficiency conditions9 used in optimization. Our analysis shows that SOS policies are locally attracting10 with high probability, and we show that policy gradient trajectories with gradient11 estimates provided by the Reinforcealgorithm achieve an O(1/ n) convergence12 rate to such equilibria if the method's step-size is chosen appropriately.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)

Add feedback

2f060912eacace9ce61ef339205ec54c-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 07:56:12 GMT

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > United States > California (0.46)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

160adf2dc118a920e7858484b92a37d8-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 06:17:41 GMT

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Fractal Landscapes in Policy Optimization

Neural Information Processing SystemsApr-24-2026, 19:13:19 GMT

Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and Hölder exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.

machine learning, objective function, reinforcement learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Fractal Landscapes in Policy Optimization

Neural Information Processing SystemsApr-24-2026, 19:13:16 GMT

Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and Hölder exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.

machine learning, objective function, reinforcement learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

11c484ea9305ea4c7bb6b2e6d570d466-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 18:29:50 GMT

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country: North America > Canada > Alberta (0.28)

Genre: Research Report (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

11c484ea9305ea4c7bb6b2e6d570d466-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 18:29:46 GMT

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country: North America > Canada > Alberta (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Add feedback

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

Neural Information Processing SystemsMar-17-2026, 19:56:02 GMT

Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method --- and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

Filters

Collaborating Authors

policy gradient method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

Finite-Time Analysis of Single-Timescale Actor-Critic

On the convergence of policy gradient methods to Nash equilibria in general stochastic games Anonymous Author(s) Affiliation Address email

2f060912eacace9ce61ef339205ec54c-Paper-Conference.pdf

160adf2dc118a920e7858484b92a37d8-Paper-Conference.pdf

Fractal Landscapes in Policy Optimization

Fractal Landscapes in Policy Optimization

11c484ea9305ea4c7bb6b2e6d570d466-Supplemental.pdf

11c484ea9305ea4c7bb6b2e6d570d466-Paper.pdf

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers