AITopics | regularized mdp

Collaborating Authors

regularized mdp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Temporal Regularization for Markov Decision Process

Pierre Thodoroff, Audrey Durand, Joelle Pineau, Doina Precup

Neural Information Processing SystemsFeb-12-2026, 18:55:45 GMT

Yetinreinforcementlearning,duetothenatureofthe Bellman equation, there isanopportunity toalsoexploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

Add feedback

bb1443cc31d7396bf73e7858cea114e1-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 22:15:44 GMT

Wethus establish that policy iteration on reward-robust MDPs can have the same time complexityasonregularizedMDPs.

artificial intelligence, machine learning, optimization problem, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

Wenhao Yang, Xiang Li, Zhihua Zhang

Neural Information Processing SystemsOct-2-2025, 14:46:58 GMT

We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country: Asia > China (0.15)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Semi-gradient DICE for Offline Constrained Reinforcement Learning

Kim, Woosung, Seo, JunHo, Lee, Jongmin, Lee, Byung-Jun

arXiv.org Artificial IntelligenceJun-11-2025

Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline settings. However, we have observed that recent approaches designed to enhance the offline RL performance of the DICE framework inadvertently undermine its ability to perform OPE, making them unsuitable for constrained RL scenarios. In this paper, we identify the root cause of this limitation: their reliance on a semi-gradient optimization, which solves a fundamentally different optimization problem and results in failures in cost estimation. Building on these insights, we propose a novel method to enable OPE and constrained RL through semi-gradient DICE. Our method ensures accurate cost estimation and achieves state-of-the-art performance on the offline constrained RL benchmark, DSRL.

machine learning, reinforcement learning, semidice, (16 more...)

arXiv.org Artificial Intelligence

2506.08644

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reviews: Surrogate Objectives for Batch Policy Optimization in One-step Decision Making

Neural Information Processing SystemsJan-25-2025, 06:30:26 GMT

Summary: The main points in the paper are: -- expected reward objective has exponentially many local maxima -- smooth risk and hence, the new loss L(q, r, x) which are both calibrated can be used and L is strongly convex implying a unique global optimum. Originality: The work is original. Clarity: The paper is clear to read, except some details in the experimental section, on page 4, where the meanings of the risk R(\pi) is not described clearly. Significance and comments: First, in the new objective for contextual bandits, the authors mention that this objective is not the same as the trust-region or proximal objectives used in RL (line 237), but how does this compare with the maximum entropy RL (for example, Harrnoja et.al, Soft Q-learning and Soft actor-critic) objectives with the same policy and value function/reward models? In these maxent RL formulations, an estimator similar to Eqn 12, Page 5 is optimized.

batch policy optimization, objective, surrogate objective, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.58)

Add feedback

Twice regularized MDPs and the equivalence between robustness and regularization

Neural Information Processing SystemsJan-18-2025, 22:11:28 GMT

Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization.

mdp, regularization, regularized mdp, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.62)

Add feedback

Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability

Tamar, Aviv, Soudry, Daniel, Zisselman, Ev

arXiv.org Artificial IntelligenceSep-24-2021

In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common approximation, which has been recently popularized as meta-RL, is to train the agent on a sample of $N$ problem instances from the prior, with the hope that for large enough $N$, good generalization behavior to an unseen test instance will be obtained. In this work, we study generalization in Bayesian RL under the probably approximately correct (PAC) framework, using the method of algorithmic stability. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense. Most stability results in the literature build on strong convexity of the regularized loss -- an approach that is not suitable for RL as Markov decision processes (MDPs) are not convex. Instead, building on recent results of fast convergence rates for mirror descent in regularized MDPs, we show that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. This result, which may be of independent interest, allows us to study the effect of regularization on generalization in the Bayesian RL setting.

generalization, mdp, probability, (17 more...)

arXiv.org Artificial Intelligence

2109.11792

Country:

Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.70)

Add feedback

Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

Yang, Wenhao, Li, Xiang, Xie, Guangzeng, Zhang, Zhihua

arXiv.org Machine LearningOct-31-2020

Reinforcement learning (RL) has achieved great success empirically, especially when policy and value function are parameterized by neural networks. Many studies [16, 21, 24, 11] have shown powerful and striking performance of RL compared to human-level performance. Dynamic Programming [19, 20, 10, 3] and Policy Gradient method [31, 26, 13] are the most frequently used optimization tools in these studies. However, when policy gradient methods are applied, theoretically understanding the success of RL is still limited in the case that policy is searched either on simplex or parameterized space. There is a line of recent work [6, 1, 5] on convergence performance of policy gradient methods for MDPs without parameterization, while another line of recent work [15, 7, 30, 8] focus on MDPs with parameterization. In addition, during the process of learning MDPs, it is often observed that the obtained policy could be quite deterministic while the environment is not fully explored. Some prior works [2, 17, 9, 28] propose to impose the Shannon entropy to each reward to make the policy stochastic, so agent can explore the environment instead of trapping in a local place and achieves success. Intuitively and empirically speaking, adding entropy regularization helps soften the learning process and encourage agents to explore more, so it might fasten convergence.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2011.00213

Country: North America > United States (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)

Add feedback

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Shani, Lior, Efroni, Yonathan, Mannor, Shie

arXiv.org Machine LearningSep-6-2019

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be `close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in {\em regularized MDPs} for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

artificial intelligence, machine learning, trpo, (17 more...)

arXiv.org Machine Learning

1909.02769

Country:

Europe (0.67)
Asia > Middle East (0.27)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback