AITopics | policy gradient algorithm

Globally Optimal Policy Gradient Algorithms for Reinforcement Learning with PID Control Policies

Neural Information Processing SystemsJun-14-2026, 06:21:02 GMT

RL enables learning control policies through direct interaction with a system, without explicit model knowledge that is typically assumed in classical control. The PID policy architecture offers built-in structural advantages, such as superior tracking performance, elimination of steady-state errors, and robustness to model error that have made it a widely adopted paradigm in practice. Despite these advantages, the PID parameterization has received limited attention in the RL literature, and PID control designs continue to rely on heuristic tuning rules without theoretical guarantees. We address this gap by rigorously integrating PID control with RL, offering theoretical guarantees while maintaining the practical advantages that have made PID control ubiquitous in practice. Specifically, we first formulate PID control design as an optimization problem with a control policy that is parameterized by proportional, integral, and derivative components. We derive exact expressions for policy gradients in these parameters, and leverage them to develop both model-based and model-free policy gradient algorithms for PID policies. We then establish gradient dominance properties of the PID policy optimization problem, and provide theoretical guarantees on convergence and global optimality in this setting.

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.82)

Add feedback

Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding

Neural Information Processing SystemsMar-21-2026, 07:25:05 GMT

In this paper, we propose a provably efficient natural policy gradient algorithm called Spectral Dynamic Embedding Policy Optimization (\SDEPO) for two-player zero-sum stochastic Markov games with continuous state space and finite action space. In the policy evaluation procedure of our algorithm, a novel kernel embedding method is employed to construct a finite-dimensional linear approximations to the state-action value function. We explicitly analyze the approximation error in policy evaluation, and show that \SDEPO\ achieves an $\tilde{O}(\frac{1}{(1-\gamma)^3\epsilon})$ last-iterate convergence to the $\epsilon-$optimal Nash equilibrium, which is independent of the cardinality of the state space. The complexity result matches the best-known results for global convergence of policy gradient algorithms for single agent setting. Moreover, we also propose a practical variant of \SDEPO\ to deal with continuous action space and empirical results demonstrate the practical superiority of the proposed method.

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.64)

Add feedback

69bfa2aa2b7b139ff581a806abf0a886-Paper.pdf

Neural Information Processing SystemsFeb-19-2026, 02:43:30 GMT

algorithm, episodic learning process, terminal state, (12 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
North America > Canada (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

7d4c0094ae32530494c71468558ab5b1-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 11:07:22 GMT

artificial intelligence, constraint, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Europe > Austria (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Tengyang Xie, Bo Liu, Yangyang Xu, Mohammad Ghavamzadeh, Yinlam Chow, Daoming Lyu, Daesub Yoon

Neural Information Processing SystemsFeb-12-2026, 19:18:25 GMT

Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Information Technology (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Deep Recurrent Optimal Stopping

Neural Information Processing SystemsDec-24-2025, 07:10:08 GMT

Deep neural networks (DNNs) have recently emerged as a powerful paradigm for solving Markovian optimal stopping problems. However, a ready extension of DNN-based methods to non-Markovian settings requires significant state and parameter space expansion, manifesting the curse of dimensionality. Further, efficient state-space transformations permitting Markovian approximations, such as those afforded by recurrent neural networks (RNNs), are either structurally infeasible or are confounded by the curse of non-Markovianity. Considering these issues, we introduce, for the first time, an optimal stopping policy gradient algorithm (OSPG) that can leverage RNNs effectively in non-Markovian settings by implicitly optimizing value functions without recursion, mitigating the curse of non-Markovianity. The OSPG algorithm is derived from an inference procedure on a novel Bayesian network representation of discrete-time non-Markovian optimal stopping trajectories and, as a consequence, yields an offline policy gradient algorithm that eliminates expensive Monte Carlo policy rollouts.

deep recurrent optimal stopping, name change, policy gradient algorithm, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Lee, Jongmin, Ryu, Ernest K.

arXiv.org Artificial IntelligenceOct-22-2025

The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $γ< 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $γ= 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $γ= 1$) can be replaced with a new object that we call the transient visitation measure.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2510.1834

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Add feedback

7d4c0094ae32530494c71468558ab5b1-Paper-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 23:25:35 GMT

artificial intelligence, constraint, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Singapore (0.04)
Europe > Austria (0.04)
(9 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

69bfa2aa2b7b139ff581a806abf0a886-Paper.pdf

Neural Information Processing SystemsOct-3-2025, 03:47:50 GMT

episodic learning process, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

North America (0.28)
Asia > Japan (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsOct-2-2025, 19:52:35 GMT

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper derives policy gradient algorithms for risk-sensitive MDPs for the particular criterion CVaR - a recent and popular criterion. First, the author derive gradients for the objective based on a Lagrangian relaxation of the constrained optimization. This naturally turns into a policy gradient algorithm where the expected return that appears in the gradient is estimated from full trajectories (reinforce-like). They then propose a scheme to obtain incremental actor-critic versions, where the critic computes the value (and other quantities) of an augmented MDP convenient for gradient estimation.

algorithm, contribution, experiment, (11 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.35)

Add feedback

Filters

Collaborating Authors

policy gradient algorithm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Globally Optimal Policy Gradient Algorithms for Reinforcement Learning with PID Control Policies

Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding

69bfa2aa2b7b139ff581a806abf0a886-Paper.pdf

7d4c0094ae32530494c71468558ab5b1-Paper-Conference.pdf

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Deep Recurrent Optimal Stopping

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

7d4c0094ae32530494c71468558ab5b1-Paper-Conference.pdf

69bfa2aa2b7b139ff581a806abf0a886-Paper.pdf

Export Reviews, Discussions, Author Feedback and Meta-Reviews