AITopics

Technology: Information Technology > Artificial Intelligence (0.55)

arXiv.org Machine LearningMar-20-2026

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Wang, Hao, Pan, Licheng, Chen, Zhichao, Zheng, Chunyuan, Chu, Zhixuan, Li, Xiaoxi, Lu, Yuan, Liu, Xinggao, Li, Haoxuan, Lin, Zhouchen

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2603.18736

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 13:07:31 GMT

9cd013fe250ebffc853b386569ab18c0-Paper.pdf

evaluation, exposure mechanism, recommender system, (14 more...)

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsDec-24-2025, 08:57:21 GMT

Adversarial Counterfactual Learning and Evaluation for Recommender System

The feedback data of recommender systems are often subject to what was exposed to the users; however, most learning and evaluation methods do not account for the underlying exposure mechanism. We first show in theory that applying supervised learning to detect user preferences may end up with inconsistent results in the absence of exposure information. The counterfactual propensity-weighting approach from causal inference can account for the exposure mechanism; nevertheless, the partial-observation nature of the feedback data can cause identifiability issues. We propose a principled solution by introducing a minimax empirical risk formulation. We show that the relaxation of the dual problem can be converted to an adversarial game between two recommendation models, where the opponent of the candidate model characterizes the underlying exposure mechanism. We provide learning bounds and conduct extensive simulation studies to illustrate and justify the proposed approach over a broad range of recommendation settings, which shed insights on the various benefits of the proposed approach.

adversarial counterfactual learning and evaluation, name change, recommender system, (4 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.64)

Neural Information Processing SystemsAug-15-2025, 09:34:23 GMT

9cd013fe250ebffc853b386569ab18c0-Paper.pdf

evaluation, exposure mechanism, recommender system, (14 more...)

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Data Science (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsOct-10-2024, 21:59:34 GMT

Adversarial Counterfactual Learning and Evaluation for Recommender System

adversarial counterfactual learning and evaluation, exposure mechanism, recommender system, (1 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

arXiv.org Artificial IntelligenceOct-4-2024

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Kim, Kyuyoung, Seo, Ah Jeong, Liu, Hao, Shin, Jinwoo, Lee, Kimin

Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench. Notably, the 7B model trained with MMPO achieves state-of-the-art performance on RewardBench as of June 2024, outperforming other models of the same scale. Our analysis also shows that MMPO is more robust to overfitting, leading to better-calibrated models.

preprint arxiv, probability, score difference, (15 more...)

2410.03145

Genre: Research Report (0.82)

Industry: Government > Military (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Bansal, Hritik, Dang, John, Grover, Aditya

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

arXiv.org Artificial IntelligenceFeb-5-2024

Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.

instruction, protocol, reward model, (15 more...)

2308.15812

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe (0.04)
Asia (0.04)
Africa (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Games > Computer Games (0.93)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

arXiv.org Artificial IntelligenceNov-3-2023

Continually Improving Extractive QA via Human Feedback

Gao, Ge, Chen, Hung-Ting, Artzi, Yoav, Choi, Eunsol

We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation.

interaction, paragraph, user feedback, (17 more...)

2305.12473

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
Asia > Middle East > Jordan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.35)

Xu, Hangtong, Xu, Yuanbo, Yang, Yongjian

Causal Structure Representation Learning of Confounders in Latent Space for Recommendation

arXiv.org Artificial IntelligenceNov-2-2023

Inferring user preferences from the historical feedback of users is a valuable problem in recommender systems. Conventional approaches often rely on the assumption that user preferences in the feedback data are equivalent to the real user preferences without additional noise, which simplifies the problem modeling. However, there are various confounders during user-item interactions, such as weather and even the recommendation system itself. Therefore, neglecting the influence of confounders will result in inaccurate user preferences and suboptimal performance of the model. Furthermore, the unobservability of confounders poses a challenge in further addressing the problem. To address these issues, we refine the problem and propose a more rational solution. Specifically, we consider the influence of confounders, disentangle them from user preferences in the latent space, and employ causal graphs to model their interdependencies without specific labels. By cleverly combining local and global causal graphs, we capture the user-specificity of confounders on user preferences. We theoretically demonstrate the identifiability of the obtained causal graph. Finally, we propose our model based on Variational Autoencoders, named Causal Structure representation learning of Confounders in latent space (CSC). We conducted extensive experiments on one synthetic dataset and five real-world datasets, demonstrating the superiority of our model. Furthermore, we demonstrate that the learned causal representations of confounders are controllable, potentially offering users fine-grained control over the objectives of their recommendation lists with the learned causal graphs.

causal graph, confounder, graph, (13 more...)

2311.03382

Country:

Asia > China > Jilin Province > Changchun (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)