Value Drifts: Tracing Value Alignment During LLM Post-Training
Bhatia, Mehar, Nayak, Shravan, Kamath, Gaurav, Mosbach, Marius, Stańczak, Karolina, Shwartz, Vered, Reddy, Siva
–arXiv.org Artificial Intelligence
As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.
arXiv.org Artificial Intelligence
Oct-31-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Austria > Vienna (0.14)
- Latvia > Lubāna Municipality
- Lubāna (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- North America
- Canada
- British Columbia (0.04)
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- United States
- California
- Los Angeles County > Los Angeles (0.14)
- San Diego County > San Diego (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- California
- Canada
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance > Economy (0.93)
- Education (1.00)
- Government
- Immigration & Customs (1.00)
- Regional Government (0.71)
- Health & Medicine > Therapeutic Area
- Obstetrics/Gynecology (0.49)
- Law > Civil Rights & Constitutional Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Technology: