JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
–Neural Information Processing Systems
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs.
Neural Information Processing Systems
Dec-26-2025, 05:27:37 GMT
- Technology: