The Hydra Effect: Emergent Self-repair in Language Model Computations

Open in new window