Alignment is Localized: A Causal Probe into Preference Layers

Open in new window