AITopics | goodhart

Collaborating Authors

goodhart

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The inevitable weakness of metrics

MIT Technology ReviewJun-19-2026, 09:00:00 GMT

Quantifying our lives is easier than it's ever been. But a philosopher of games warns that external metrics and data can never capture what's truly important. There are plenty of useful things a metric can reveal. There are even more it can obscure or corrupt. It took me well over a decade of tracking my own life in ever greater detail to fully appreciate this duality, which probably reveals something about both me and the nature of measurement. Like a lot of people bitten by the self-quantifying bug, I initially started gathering personal data to pursue a nebulous collection of goals and desires.

artificial intelligence, metric, social media, (15 more...)

MIT Technology Review

Country: North America > United States (0.28)

Genre: Summary/Review (0.50)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Consumer Health (0.94)
Law > Statutes (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Neural Information Processing SystemsMar-18-2026, 15:31:30 GMT

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.60)

Add feedback

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Neural Information Processing SystemsFeb-8-2026, 18:43:00 GMT

However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model-a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error.

kl divergence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.46)
Energy (0.46)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Neural Information Processing SystemsDec-23-2025, 17:15:23 GMT

Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on yes'' when the common training answer was ``no''.

goodhart, name change, out-of-distribution testing, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Neural Information Processing SystemsOct-9-2025, 19:56:40 GMT

goodhart, kl divergence, optimization, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.46)
Energy (0.46)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

Maier, Antoine, Maier, Aude, David, Tom

arXiv.org Artificial IntelligenceOct-6-2025

A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer's intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart's law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.

large language model, machine learning, objective function, (14 more...)

arXiv.org Artificial Intelligence

2510.0284

Country:

North America > United States (0.67)
Europe > United Kingdom > England (0.46)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

The Strong, Weak and Benign Goodhart's law. An independence-free and paradigm-agnostic formalisation

Majka, Adrien, El-Mhamdi, El-Mahdi

arXiv.org Machine LearningMay-30-2025

Goodhart's law is a famous adage in policy-making that states that ``When a measure becomes a target, it ceases to be a good measure''. As machine learning models and the optimisation capacity to train them grow, growing empirical evidence reinforced the belief in the validity of this law without however being formalised. Recently, a few attempts were made to formalise Goodhart's law, either by categorising variants of it, or by looking at how optimising a proxy metric affects the optimisation of an intended goal. In this work, we alleviate the simplifying independence assumption, made in previous works, and the assumption on the learning paradigm made in most of them, to study the effect of the coupling between the proxy metric and the intended goal on Goodhart's law. Our results show that in the case of light tailed goal and light tailed discrepancy, dependence does not change the nature of Goodhart's effect. However, in the light tailed goal and heavy tailed discrepancy case, we exhibit an example where over-optimisation occurs at a rate inversely proportional to the heavy tailedness of the discrepancy between the goal and the metric. %

artificial intelligence, exp, machine learning, (18 more...)

arXiv.org Machine Learning

2505.23445

Country:

North America > United States (0.14)
Europe > France (0.04)

Genre: Research Report > New Finding (0.54)

Industry:

Government (0.46)
Law (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Review for NeurIPS paper: On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Neural Information Processing SystemsJan-21-2025, 04:54:02 GMT

Summary and Contributions: This paper provides an investigation of out-of-distribution generalization in visual question answering, as benchmarked by prior works on the VQA-CP dataset. The VQA-CP dataset by Agrawal et al. has different distributions in training and test, intentionally constructed so to encourage models to truly perform reasoning and generalize better, instead of naively picking up on question-only biases in the dataset. However, the authors demonstrate how several prior works on VQA-CP have (inadvertently) gamed this evaluation dataset without necessarily making progress due to a number of issues -- 1) exploiting knowledge of how the train/test splits were constructed to build models such that a) models are conditioned on the question prefix (and so will only work well on VQA-CP test and not generalize beyond), or b) poorly fit the training set. Next, the authors provide a few naive baselines that exploit the aforementioned issues (and as the authors acknowledge -- is not useful for any practical purposes) and perform well on VQA-CP test -- 1) a random predictions model that inverts the predicted answer distribution from training to test, and 2) a learned BUTD model that artificially ignores the top-predicted answer on VQA-CP test. The fact that a random predictions inverted model performs better on number and yes/no questions -- the question set that constitutes the largest fraction of performance -- is alarming, and provides a necessary and timely check on prior works on VQA-CP.

dataset, out-of-distribution testing, vqa-cp test, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Wen, Xueru, Lou, Jie, Lu, Yaojie, Lin, Hongyu, Yu, Xing, Lu, Xinyu, He, Ben, Han, Xianpei, Zhang, Debing, Sun, Le

arXiv.org Artificial IntelligenceDec-9-2024

Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of the Regressional Goodhart effect, we recognize that accuracy, when used for measuring RM quality, can fail to fully capture the potential RM overoptimization. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.

correlation, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2410.05584

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

On Goodhart's law, with an application to value alignment

El-Mhamdi, El-Mahdi, Hoang, Lê-Nguyên

arXiv.org Machine LearningOct-12-2024

``When a measure becomes a target, it ceases to be a good measure'', this adage is known as {\it Goodhart's law}. In this paper, we investigate formally this law and prove that it critically depends on the tail distribution of the discrepancy between the true goal and the measure that is optimized. Discrepancies with long-tail distributions favor a Goodhart's law, that is, the optimization of the measure can have a counter-productive effect on the goal. We provide a formal setting to assess Goodhart's law by studying the asymptotic behavior of the correlation between the goal and the measure, as the measure is optimized. Moreover, we introduce a distinction between a {\it weak} Goodhart's law, when over-optimizing the metric is useless for the true goal, and a {\it strong} Goodhart's law, when over-optimizing the metric is harmful for the true goal. A distinction which we prove to depend on the tail distribution. We stress the implications of this result to large-scale decision making and policies that are (and have to be) based on metrics, and propose numerous research directions to better assess the safety of such policies in general, and to the particularly concerning case where these policies are automated with algorithms.

algorithm, goodhart, tail distribution, (15 more...)

arXiv.org Machine Learning

2410.09638

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Switzerland (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Government (0.67)
Media (0.67)
Health & Medicine (0.67)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.45)

Add feedback