AITopics | human feedback

Collaborating Authors

human feedback

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Markov Chain Approach to Preference Alignment

Koriyama, Takuya, Liang, Tengyuan

arXiv.org Machine LearningJun-23-2026

We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility $U(x,y)$, which quantifies human preference for $y$ over $x$, and a reference probability distribution $μ_{\mathsf{ref}}$, we define a Markov kernel $\mathsf{P}(x, dy)\propto \exp(U(x,y))μ_{\mathsf{ref}}(dy)$, and take the Markov chain starting from $μ_{\mathsf{ref}}$ as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm $\|U\|_\oplus=\inf_{g,f\in L^\infty(μ_{\mathsf{ref}})}\|U-g\oplus f\|_\infty$, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when $\|U\|_\oplus$ is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward $\hat{f}(y)=\int μ_{\mathsf{ref}}(dx) U(x, y)$, and starting from the second iteration, both algorithms incorporate the same linear functional of the residual $U-(-\hat f)\oplus \hat f$, which captures the non-transitive structure of the pairwise utility $U$.

artificial intelligence, deep learning, machine learning, (13 more...)

arXiv.org Machine Learning

2606.22652

Country: North America > United States > New York (0.28)

Genre:

Research Report (0.50)
Workflow (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Clean First Align Later Preference Data Cleaning for Reliable

Neural Information Processing SystemsJun-22-2026, 02:48:38 GMT

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality--highlighting the crucial but underexplored role of data preprocessing in responsible AI development.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia (0.28)
North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Consumer Health (0.46)
Government (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ask a Strong LLMJudge when Your Reward Model is Uncertain

Neural Information Processing SystemsJun-18-2026, 04:48:24 GMT

Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.

Add feedback

Direct Alignment with Heterogeneous Preferences

Neural Information Processing SystemsJun-17-2026, 22:10:18 GMT

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

arxiv preprint arxiv, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.45)
Europe (0.45)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)
Questionnaire & Opinion Survey (0.67)

Industry: Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Diverse annotators Soft pairwise labels Distribution over rewards Distribution over policies

Neural Information Processing SystemsJun-17-2026, 08:16:25 GMT

However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse preference distributions. Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment > Sports (1.00)
Banking & Finance > Economy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Can Diverse Human Values Scaling Law

Neural Information Processing SystemsJun-15-2026, 20:41:29 GMT

Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting realworld LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Industry: Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Explaining and Preventing Alignment Collapse in Iterative RLHF

Gauthier, Etienne, Bach, Francis, Jordan, Michael I.

arXiv.org Machine LearningMay-7-2026

Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Machine Learning

2605.04266

Country:

Europe > United Kingdom (0.68)
North America > United States (0.67)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

5fc47800ee5b30b8777fdd30abcaaf3b-Supplemental-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 03:37:50 GMT

Having defined and validated the pairwise feedback simulator and evaluations in AlpacaFarm, we569 now turn our attention to studying methods that learn from pairwise feedback on AlpacaFarm.570 Unfortunately, the lack of existing benchmarks for learning from pairwise feedback for instruction571 following means that there has not been any open study of these methods in the instruction-following572 setting. In the remainder of this section, we will introduce our reference methods, which fall into two575 categories based on whether they fit a surrogate reward model as part of the learning process.576 FeedME is a method proposed by OpenAI [45] that incorporates human feedback578 with supervised fine-tuning on model generations that are rated 7/7 by human labelers. We adapt579 this approach to the pairwise feedback setting and call this baseline binary FeedME. This approach580 fine-tunes the SFT model on the chosen response in each preference pair with supervised learning.581 Motivated by controllable generation through conditioning [27, 34,582 29, 21], we propose binary reward conditioning, a baseline method that fine-tunes the SFT model583 with the feedback data Dpairwise by conditioning instances with either a positive or negative control584 token. Specifically, for each instance (x,y0,y1,z) 2D pairwise, the string concatenation of instruction585 x and response yz denoted as [x,yz] is prepended with the positive token and used in supervised586 fine-tuning (similarly [x,y1 z]is prepended with the negative token). This process creates a modified587 demonstration dataset that is double the size of Dpairwise. At test time, we draw samples from the588 fine-tuned model conditioned on the positive token.589 A.2 Methods that optimize a surrogate reward function590 We now describe methods that incorporate feedback by first building a surrogate reward model with591 pairwise feedback data. To start, we describe the step of training the surrogate reward model.592 While this can be a powerful approach,596 we will see that it can also lead to over-optimization [19] where models learn to exploit the reward597 model rather than achieve high true reward. We now describe 4 methods that leverage the surrogate598 reward model.599

annotator, large language model, machine learning, (22 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.52)

Add feedback

Transfer Q: Principled Decoding for LLMAlignment

Neural Information Processing SystemsApr-30-2026, 01:19:06 GMT

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward r, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function (Q), which is often unavailable in practice. Hence, prior SoTA methods either approximate this Q using Qπsft (derived from the reference SFTmodel) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer Q, which implicitly estimates the optimal value function for a target reward r through a baseline model ρBL aligned with a baseline reward rBL (which can be different from the target reward r). Theoretical analyses of Transfer Q provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference SFTmodel based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Asia (0.46)

Genre: Research Report > Experimental Study (0.93)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Communications (0.92)

Add feedback

Reinforcement Learning from Human Feedback: A Statistical Perspective

Liu, Pangpang, Shi, Chengchun, Sun, Will Wei

arXiv.org Machine LearningApr-6-2026

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Machine Learning

2604.02507

Country: