On the Limitations of Steering in Language Model Alignment

Open in new window