Refusal in Language Models Is Mediated by a Single Direction

Open in new window