Preference Learning for AI Alignment: a Causal Perspective

Kobalczyk, Katarzyna, van der Schaar, Mihaela

Jun-9-2025–arXiv.org Machine Learning

Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Jun-9-2025

arXiv.org PDF

Add feedback

Country:
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America
  - Canada (0.04)
  - United States
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Germany > Berlin (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)
    - Oxfordshire > Oxford (0.14)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report
  - New Finding (0.67)
  - Experimental Study (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.88)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found