BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Open in new window