The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Open in new window