JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Mar-20-2026, 22:25:46 GMT–Neural Information Processing Systems

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Mar-20-2026, 22:25:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.82)