How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Open in new window