AITopics | pg method

Collaborating Authors

pg method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 18:47:41 GMT

Reinforcement learning with general utilities (RLGU) offers a unifying framework to capture several problems beyond standard expected returns, including imitation learning, pure exploration, and safe RL. Despite recent fundamental advances in the theoretical analysis of policy gradient (PG) methods for standard RL and recent efforts in RLGU, the understanding of these PG algorithms and their scope of application in RLGU still remain limited. In this work, we establish global optimality guarantees of PG methods for RLGU in which the objective is a general concave utility function of the state-action occupancy measure. In the tabular setting, we provide global optimality results using a new proof technique building on recent theoretical developments on the convergence of PG methods for standard RL using gradient domination. Our proof technique opens avenues for analyzing policy parameterizations beyond the direct policy parameterization for RLGU. In addition, we provide global optimality results for large state-action space settings beyond prior work which has mostly focused on the tabular setting. In this large scale setting, we adapt PG methods by approximating occupancy measures within a function approximation class using maximum likelihood estimation. Our sample complexity only scales with the dimension induced by our approximation class instead of the size of the state-action space.

artificial intelligence, machine learning, proceedings, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.59)

Add feedback

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Kiyohara, Haruka, Curmei, Mihaela, Evnine, Ariel, Kalyanaraman, Shankar, Nir, Israel, Pop, Ana-Roxana, Razin, Nitzan, Dean, Sarah, Joachims, Thorsten, Weinsberg, Udi

arXiv.org Machine LearningMay-27-2026

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

large language model, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2605.26385

Country: Asia (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

main

Neural Information Processing SystemsApr-24-2026, 21:32:21 GMT

We then discuss, in 2.2, the challenges one confronts when attempting to address the above two problems directly using derivative-free PG methods by sampling system trajectories. Fortunately, solving zero-sum LQ (stochastic) dynamic games, a benchmark setting in MARL, via derivative-free PG methods by sampling system trajectories provides a workaround to address these problems all in a unified way, due to the well-known equivalence relationships between zero-sum LQ dynamic games and the two aforementioned classes of problems [25], which we will also discuss in A.3.3. A.3.1 Linear Exponential Quadratic Gaussian We first consider a fundamental setting of risk-sensitive optimal control, known as the LEQG problem [22, 27, 28], in the finite-horizon setting. The time-varying (linear) systems dynamics are described by: xt+1 =Atxt +Btut +wt,t 2{0,,N 1}, where xt 2Rm represents the system state; ut 2Rd is the control input; wt 2Rm is an independent (across time) Gaussian random noise drawn from wt N (0,W) for some W> 0; the initial state x0 N (0,X0) is a Gaussian random vector for some X0 >0, independent of the sequence {wt};and At, Bt are time-varying system matrices with appropriate dimensions.

artificial intelligence, machine learning, probability, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.45)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

main_final

Neural Information Processing SystemsApr-24-2026, 21:32:17 GMT

Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.

arxiv preprint arxiv, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.34)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Ordering-based Conditions for Global Convergence of Policy Gradient Methods

Neural Information Processing SystemsFeb-12-2026, 16:41:21 GMT

The conditions on the representation that imply global convergence are different between these two algorithms.

artificial intelligence, machine learning, optimization problem, (16 more...)

Neural Information Processing Systems

Country: North America > Canada > Alberta (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.77)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

An operator view of policy gradient methods

Neural Information Processing SystemsFeb-7-2026, 19:34:34 GMT

These techniques mainly fall in one of two categories: value-based methods [e.g.,

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

main_final

Neural Information Processing SystemsFeb-7-2026, 15:54:44 GMT

arxiv preprint arxiv, disturbance attenuation problem, matrix, (11 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Illinois (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.46)

Industry: Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Add feedback

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Neural Information Processing SystemsDec-24-2025, 01:42:50 GMT

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied on NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods.

artificial intelligence, name change, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

Filters

Collaborating Authors

pg method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

main

main_final

024677efb8e4aee2eaeef17b54695bbe-Paper.pdf

61c00c07e6d27285e4b952e96cc65666-Supplemental-Conference.pdf

Ordering-based Conditions for Global Convergence of Policy Gradient Methods

An operator view of policy gradient methods

main_final

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods