The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Ferrao, Jeremias, van der Lende, Matthijs, Lichkovski, Ilija, Neo, Clement
–arXiv.org Artificial Intelligence
Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
arXiv.org Artificial Intelligence
Dec-2-2025
- Country:
- Asia > Middle East
- Europe
- Austria > Vienna (0.14)
- United Kingdom
- North America
- Canada > British Columbia
- Vancouver (0.04)
- United States
- Delaware (0.04)
- Maine (0.04)
- Massachusetts (0.04)
- Pennsylvania (0.04)
- Rhode Island (0.04)
- Virginia (0.04)
- Wisconsin > Dane County
- Madison (0.04)
- Canada > British Columbia
- Genre:
- Research Report > New Finding (1.00)
- Technology: