Why Alignment Must Precede Distillation: A Minimal Working Explanation
–arXiv.org Artificial Intelligence
For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment's reference model: its distributional recall. We show that the standard KD Align workflow diminishes the model's capacity to align rare yet desirable behaviors, even under strong preference signals. We instead demonstrate that reversing the pipeline (i.e., Align KD) is essential: alignment must first be performed on a high-recall reference before distillation. First, we provide a minimal working explanation of how the reference model constrains preference alignment objectives at a fundamental level. Second, we validate this theory in a controllable Mixture-of-Gaussians experiment, where low-recall anchoring consistently results in suboptimal model performance. Finally, we demonstrate that the same phenomenon holds in LLM alignment with the SmolLM2 family: models aligned after KD fail to effectively align target behaviors, resulting in substantially lower reward and target precision. In contrast, our proposed Align KD pipeline robustly aligns these behaviors, yielding models with superior target-oriented metrics and lower variance. Together, these results establish reference-model recall as a first-order design choice in alignment, offering a clear principle: alignment must precede distillation. The alignment of large language models (LLMs) with human preferences has emerged as a central challenge in modern AI research. Building on pretrained models with vast general knowledge, algorithms such as Reinforcement Learning from Human Feedback (RLHF; Ziegler et al. (2019); Stiennon et al. (2020); Ouyang et al. (2022)) via PPO (Schulman et al., 2017) and Direct Preference Optimization (DPO; Rafailov et al. (2023)) have become standard methods. RLHF generally formulates alignment as reward maximization under a Kullback-Leibler (KL) penalty to a fixed reference model, while DPO reparameterizes preference learning into a pairwise loss that still anchors to the same reference.
arXiv.org Artificial Intelligence
Sep-30-2025