Activation Scaling for Steering and Interpreting Language Models

Open in new window