AITopics | policy parametrization

Policy Gradient Methods Converge Globally in Imperfect-Information Extensive-Form Games

Neural Information Processing SystemsJun-15-2026, 20:28:56 GMT

Multi-agent reinforcement learning (MARL) has long been seen as inseparable from Markov games (Littman, 1994). Yet, the most remarkable achievements of practical MARL have arguably been in extensive-form games (EFGs)--spanning games like Poker, Stratego, and Hanabi. At the same time, little is known about provable equilibrium convergence for MARL algorithms applied to EFGs as they stumble upon the inherent nonconvexity of the optimization landscape and the failure of the value-iteration subroutine in EFGs. To this goal, we utilize contemporary advances in nonconvex optimization theory to prove that regularized alternating policy gradient with (i) direct policy parametrization, (ii) softmax policy parametrization, and (iii) softmax policy parametrization with natural policy gradient updates converge to an approximate Nash equilibrium (NE) in the last-iterate in imperfectinformation perfect-recall zero-sum EFGs. Namely, we observe that since the individual utilities are concave with respect to the sequence-form strategy, they satisfy gradient dominance with respect to the behavioral strategy--or, policy, in reinforcement learning terms. We exploit this structure to further prove that the regularized utility satisfies the much stronger proximal Polyak-Łojasiewicz condition. In turn, we show that the different flavors of alternating policy gradient methods converge to an ϵ-approximate NE with a number of iterations and trajectory samples that are polynomial in 1/ϵand the natural parameters of the game. Our work is a preliminary--yet principled--attempt in bridging the conceptual gap between the theory of Markov and imperfect-information EFGs while it aspires to stimulate a deeper dialogue between them.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.45)

Genre: Research Report > Experimental Study (1.00)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

56577889b3c1cd083b6d7b32d32f99d5-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 11:46:52 GMT

convergence, sample complexity, trajectory, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Illinois (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.46)

Add feedback

56577889b3c1cd083b6d7b32d32f99d5-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-8-2026, 11:46:34 GMT

convergence, global convergence, numerical result, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.30)

Add feedback

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Neural Information Processing SystemsDec-24-2025, 02:27:49 GMT

We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method for CMDPs which updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Even though the underlying maximization involves a nonconcave objective function and a nonconvex constraint set under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such a convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for the general smooth policy class, we establish sublinear rates of convergence regarding both the optimality gap and the constraint violation, up to a function approximation error caused by restricted policy parametrization. Finally, we show that two sample-based NPG-PD algorithms inherit such non-asymptotic convergence properties and provide finite-sample complexity guarantees. To the best of our knowledge, our work is the first to establish non-asymptotic convergence guarantees of policy-based primal-dual methods for solving infinite-horizon discounted CMDPs. We also provide computational results to demonstrate merits of our approach.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.59)

Add feedback

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Neural Information Processing SystemsDec-24-2025, 01:42:50 GMT

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied on NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods.

artificial intelligence, name change, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

56577889b3c1cd083b6d7b32d32f99d5-Supplemental.pdf

Neural Information Processing SystemsOct-2-2025, 23:18:39 GMT

Add feedback

56577889b3c1cd083b6d7b32d32f99d5-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 23:18:32 GMT

Add feedback

convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained

Neural Information Processing SystemsOct-2-2025, 23:18:20 GMT

R1.1 ...these analysis mainly come from the existing work...the novelty is very limited. Our proposed SRVR-NPG has a better complexity than SRVR-PG (Remark 4.13). We believed our theoretical contrition already has archival value. R1.3 Reproducibility: We believe that all of our theoretical claims have been proved. Please refer to [34] for a detailed proof.

artificial intelligence, convergence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.30)

Add feedback

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Neural Information Processing SystemsOct-10-2024, 08:00:10 GMT

We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method for CMDPs which updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Even though the underlying maximization involves a nonconcave objective function and a nonconvex constraint set under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such a convergence is independent of the size of the state-action space, i.e., it is dimension-free.

constrained markov decision process, convergence, natural policy gradient primal-dual method, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.61)

Add feedback

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Neural Information Processing SystemsOct-10-2024, 05:56:47 GMT

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied on NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods. Thanks to this improvement, we have also made variance-reduction for NPG possible for the first time, with both global convergence and an efficient finite-sample complexity.

improved analysis, natural policy gradient method, variance-reduced, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.43)

Add feedback

Filters

Collaborating Authors

policy parametrization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Policy Gradient Methods Converge Globally in Imperfect-Information Extensive-Form Games

56577889b3c1cd083b6d7b32d32f99d5-Supplemental.pdf

56577889b3c1cd083b6d7b32d32f99d5-AuthorFeedback.pdf

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

56577889b3c1cd083b6d7b32d32f99d5-Supplemental.pdf

56577889b3c1cd083b6d7b32d32f99d5-Paper.pdf

convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods