Jailbroken: How Does LLM Safety Training Fail?

Dec-27-2025, 07:27:19 GMT–Neural Information Processing Systems

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created.

jailbroken, llm safety training fail, name change, (3 more...)

Neural Information Processing Systems

Dec-27-2025, 07:27:19 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.57)
  - Natural Language > Large Language Model (1.00)