AITopics | jailbreak

Collaborating Authors

jailbreak

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Neural Information Processing SystemsMar-21-2026, 02:37:18 GMT

While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed . In this work, we present (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively refine candidate (attack) prompts until one of the refined prompts jailbreaks the target. In addition, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks, reducing the number of queries sent to the target LLM. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. This significantly improves upon the previous state-of-the-art black-box methods for generating jailbreaks while using a smaller number of queries than them. Furthermore, TAP is also capable of jailbreaking LLMs protected by state-of-the-art, e.g., LlamaGuard.

large language model, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Neural Information Processing SystemsMar-20-2026, 03:55:49 GMT

Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO), to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art.

large language model, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

Many-shot Jailbreaking

Neural Information Processing SystemsFeb-18-2026, 13:56:58 GMT

Longer contexts present a new attack surface for adversarial attacks. In search of a "fruit-fly" of long-context vulnerabilities, we study Many-shot Jailbreaking (MSJ; Figure 1), a simple yet effective and scalable jailbreak.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Monaco (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry:

Law Enforcement & Public Safety (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

e2e06adf560b0706d3b1ddfca9f29756-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 11:05:58 GMT

large language model, machine learning, natural language, (23 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

e2e06adf560b0706d3b1ddfca9f29756-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 11:05:54 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

fd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 03:22:18 GMT

Competing objectives arise when a model's capabilities

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

70702e8cbb4890b4a467b984ae59828a-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 18:32:38 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Transportation (1.00)
Media > News (1.00)
Law (1.00)
(3 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

54024fca0cef9911be36319e622cde38-Paper-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 11:17:26 GMT

With a seed set of manually-identified tactics, we apply GPT -4 to expand the discovery automatically.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Africa > South Africa (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Workflow (0.67)
Research Report > New Finding (0.67)

Industry:

Media (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Information Technology > Security & Privacy (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)

Add feedback

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Andy Zhou

Neural Information Processing SystemsFeb-12-2026, 06:45:59 GMT

Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country: