Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Open in new window