jailbreak
Analogy-based Multi-Turn Jailbreak against Large Language Models
Large language models (LLMs) are inherently designed to support multi-turn interactions, which opens up new possibilities for jailbreak attacks that unfold gradually and potentially bypass safety mechanisms more effectively than singleturn attacks. However, current multi-turn jailbreak methods are still in their early stages and suffer from two key limitations. First, they all inherently require inserting sensitive phrases into the context, which makes the dialogue appear suspicious and increases the likelihood of rejection, undermining the effectiveness of the attack. Second, even when harmful content is generated, the response often fails to align with the malicious prompt due to semantic drift, where the conversation slowly moves away from its intended goal. To address these challenges, we propose an analogy-based black-box multi-turn jailbreak framework that constructs fully benign contexts to improve attack success rate while ensuring semantic alignment with the malicious intent. The method first guides the model through safe tasks that mirror the response structure of the malicious prompt, enabling it to internalize the format without exposure to sensitive content. A controlled semantic shift is then introduced in the final turn, substituting benign elements with malicious ones while preserving structural coherence. Experiments on six commercial and open-source LLMs, two benchmark datasets show that our method significantly improves attack performance, achieving an average attack success rate of 93.3% and outperforming five competitive baselines. Our code is released at AMA. WARNING: This paper contains potentially unsafe examples.
VERA: Variational Inference Framework for Jailbreaking Large Language Models
The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
Best-of-NJailbreaking
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoNJailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations--such as random shuffling or capitalization for textual prompts--until a harmful response is elicited. We find that BoNJailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers and reasoning models like o1. BoNalso seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoNreliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoNJailbreaking can also be composed with other black-box algorithms for even more effective attacks--combining BoNwith an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
The White House Wants Anthropic to Block All Jailbreaks. That May Not Be Possible
Trump administration officials tell WIRED that if Anthropic wants to rerelease Fable 5, it will need to ensure the model's guardrails can't be circumvented. Security experts say that can't be done. The Trump administration's disagreement with Anthropic over its most advanced AI models appears to be fast coming to a head. Trump officials tell Inner Loop that if Anthropic wants to rerelease Claude Fable 5, the AI model that they took offline with export controls last week over concerns about jailbreaking--a method of using prompts to get around a model's safeguards--the company will need to take steps to actually address what the government alleges are vulnerabilities. Anthropic has said for days that the administration's concerns are overblown and that the effects of the jailbreaks are minimal.
Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming
We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASRcomparisons and measurement validity challenges.
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts.
The White House Is Ratcheting Up Its War Against Anthropic
This is how America loses the AI race. In theory, Donald Trump has a consistent position on AI. On the first full day of his second term, the president declared that he would use his full authority to speed the AI industry along and, in particular, to beat China in the AI race: "We have an emergency," he said. "We have to get this stuff built." If AI is poised to become the most important technology ever made, the thinking goes, whichever country commands the most powerful bots will dominate the rest of the century and beyond. The government, it seemed, would just get out of Silicon Valley's way.
Anthropic blocks all customers' access to Fable 5 and Mythos 5
It's to ensure compliance with a government directive citing national security concerns. Anthroic has disabled all of its customers' access to Fable 5 and Mythos 5 in order to ensure compliance with an order it received from the government on Friday, June 12. All its other models and its Claude chatbot are not affected. The company said in its announcement that the US government wanted it to suspend all foreign nationals' access to its newly launched AI models, whether they're inside or outside the US and even if they're Anthropic employees, citing national security concerns. While the US government didn't specify those concerns, Anthropic believes that it's because the government heard about a method of jailbreaking Fable 5.
Anthropic Says It's Taking Claude Fable 5 Offline to Comply With US Government Order
Anthropic Says It's Taking Claude Fable 5 Offline to Comply With US Government Order "The government believes it has become aware of a method of bypassing, or'jailbreaking' Fable 5," the company said in a blog post. Anthropic says it's disabling two AI models it launched earlier this week, Claude Fable 5 and Mythos 5, to comply with an export control directive it received Friday afternoon from the US government citing national security concerns. The unprecedented incident marks the latest flashpoint between Anthropic and the Trump administration . While the company says the order asked it to suspend access to "any foreign national, whether inside or outside the United States, including foreign national Anthropic employees," it has removed access for all of its customers to ensure compliance. Earlier this year, Trump's Department of Defense labeled Anthropic a " supply chain risk " after the Claude-maker sought to draw red lines over how the US military could use its technology.
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment. Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards. Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal attacks, often exhibiting overdefensive behaviors and imposing heavy training overhead.