Best-of-NJailbreaking

Jun-18-2026, 02:47:50 GMT–Neural Information Processing Systems

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoNJailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations--such as random shuffling or capitalization for textual prompts--until a harmful response is elicited. We find that BoNJailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers and reasoning models like o1. BoNalso seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoNreliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoNJailbreaking can also be composed with other black-box algorithms for even more effective attacks--combining BoNwith an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Jun-18-2026, 02:47:50 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.67)
- Europe > United Kingdom (0.67)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
- Health & Medicine > Therapeutic Area (0.92)
- Education (0.92)
- Transportation > Air (0.86)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found