AITopics | reward uncertainty

Collaborating Authors

reward uncertainty

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PAC-Bayesian Reward-Certified Outcome Weighted Learning

Ishikawa, Yuya, Tamano, Shu

arXiv.org Machine LearningApr-3-2026

Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.

artificial intelligence, baseline, machine learning, (18 more...)

arXiv.org Machine Learning

2604.01946

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > New York > New York County > New York City (0.14)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing SystemsFeb-16-2026, 18:23:37 GMT

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

DistributionalRewardEstimationforEffective Multi-AgentDeepReinforcementLearning

Neural Information Processing SystemsFeb-8-2026, 23:29:31 GMT

While in [61], the authors recover the true supervision signals with peer loss, which punishes over-agreement for avoiding overfitting.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Northampton County > Bethlehem (0.04)
Asia > China > Jilin Province > Changchun (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.48)

Add feedback

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing SystemsOct-10-2025, 10:09:12 GMT

dataset, optimization, reward model, (15 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Appendix A Pseudocode of DRE-MARL

Neural Information Processing SystemsAug-14-2025, 20:36:07 GMT

The pseudocode for DRE-MARL training is shown in Algorithm 20, which takes the following steps. The property of the received reward in this environment is set to be collaborative. It is a scenario with two agents and three landmarks. Navigation and Reference is that the target landmark of each agent is only known to its partner. We use the abbreviation REF to denote this environment.

dre-marl, reward aggregation, reward uncertainty, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.89)

Add feedback

520425a5a4c2fb7f7fc345078b188201-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 20:36:05 GMT

estimation, reward estimation, reward uncertainty, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Pennsylvania > Northampton County > Bethlehem (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Jilin Province > Changchun (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Zhang, Xiaoying, Ton, Jean-Francois, Shen, Wei, Wang, Hongning, Liu, Yang

arXiv.org Artificial IntelligenceJul-9-2024

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

dataset, optimization, reward model, (14 more...)

arXiv.org Artificial Intelligence

2403.05171

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Roping in Uncertainty: Robustness and Regularization in Markov Games

McMahan, Jeremy, Artiglio, Giovanni, Xie, Qiaomin

arXiv.org Artificial IntelligenceJun-13-2024

We study robust Markov games (RMG) with $s$-rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a $s$-rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving $s$-rectangular RMGs, as well as provable robustness guarantees for policies computed using regularized methods. However, we show that even for just reward-uncertain two-player zero-sum matrix games, computing an RNE is PPAD-hard. Consequently, we derive a special uncertainty structure called efficient player-decomposability and show that RNE for two-player zero-sum RMG in this class can be provably solved in polynomial time. This class includes commonly used uncertainty sets such as $L_1$ and $L_\infty$ ball uncertainty sets.

regularized mg, reward uncertainty, robustness and regularization, (14 more...)

arXiv.org Artificial Intelligence

2406.08847

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Add feedback

Tractable Objectives for Robust Policy Optimization

Neural Information Processing SystemsMar-14-2024, 17:43:09 GMT

Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP.

mdp, objective, optimization, (15 more...)

Neural Information Processing Systems

Country: North America > Canada > Alberta (0.15)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.94)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)

Add feedback

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Gadot, Uri, Derman, Esther, Kumar, Navdeep, Elfatihi, Maxence Mohamed, Levy, Kfir, Mannor, Shie

arXiv.org Artificial IntelligenceSep-3-2023

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $\alpha$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

operator, reward function, robust return, (16 more...)

arXiv.org Artificial Intelligence

2309.01107

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Add feedback