Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Open in new window