Refusal in Language Models Is Mediated by a Single Direction Andy Arditi

Open in new window