Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Open in new window