AITopics | overoptimization

Collaborating Authors

overoptimization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Neural Information Processing SystemsMar-22-2026, 22:18:50 GMT

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output even undesired responses. We investigate this problem in a principled manner by identifying the source of the issue as the distributional shift and uncertainty of human preference in dataset. To mitigate overoptimization, we first propose a theoretical algorithm which optimizes the policy against an adversarially chosen reward model, one that simultaneously minimizes its MLE loss and a reward penalty term. The penalty pessimistically biases the uncertain rewards so as to prevent the policy from choosing actions with spursiouly high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed algorithm further enjoys an equivalent but surprisingly easy to implement form. With a clever usage of the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines (i) a preference optimization loss that directly aligns the policy with human preference, and (ii) a supervised learning loss which explicitly imitates the policy with a baseline distribution. In the context of aligning large language models (LLM), this objective fuses the direct preference optimization (DPO) loss with the supervised fune-tuning (SFT) loss to help mitigate the overoptimization towards undesired responses, for which we name the algorithm Regularized Preference Optimization (RPO).Experiments of aligning LLMs demonstrate the improved performance of our method when compared with DPO baselines.

artificial intelligence, large language model, natural language, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

Neural Information Processing SystemsFeb-18-2026, 18:53:13 GMT

Then it fine-tunes the LLM to maximize the learned reward using RL techniques.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao

Neural Information Processing SystemsFeb-18-2026, 16:24:08 GMT

With the advent of large language models (LLMs), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models' behaviors with human values [

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.14)
North America > United States > Virginia (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing SystemsFeb-16-2026, 18:23:37 GMT

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

b607ba543ad05417b8507ee86c54fcb7-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 22:58:40 GMT

proxy utility function, robot, utility function, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

Chen, Ziyi, Li, Junyi, Yu, Peiran, Huang, Heng

arXiv.org Artificial IntelligenceDec-10-2025

Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.

algorithm, artificial intelligence, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.05526

Country: North America > United States (0.67)

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

Neural Information Processing SystemsOct-10-2025, 22:05:18 GMT

Then it fine-tunes the LLM to maximize the learned reward using RL techniques.

arxiv preprint arxiv, objective, overoptimization, (15 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao

Neural Information Processing SystemsOct-10-2025, 21:18:39 GMT

With the advent of large language models (LLMs), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models' behaviors with human values [

dataset, overoptimization, rlhf, (13 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.14)
North America > United States > Virginia (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

Neural Information Processing SystemsOct-10-2025, 19:51:18 GMT

To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions.

arxiv preprint arxiv, diffusion model, generative model, (12 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Filters

Collaborating Authors

overoptimization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao

e68274fc4f158dbcbd4dddc672f7ee9c-Paper-Conference.pdf

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

b607ba543ad05417b8507ee86c54fcb7-Supplemental.pdf

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models