Jailbroken: How Does LLM Safety Training Fail?
–Neural Information Processing Systems
Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created.
Neural Information Processing Systems
Dec-27-2025, 07:27:19 GMT
- Technology: