Goto

Collaborating Authors

 crescendo


PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Bhuiya, Neeladri, Aggarwal, Madhav, Purwar, Diptanshu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.


Capability-Based Scaling Laws for LLM Red-Teaming

Panfilov, Alexander, Kassianik, Paul, Andriushchenko, Maksym, Geiping, Jonas

arXiv.org Artificial Intelligence

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.


Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Ren, Qibing, Li, Hao, Liu, Dongrui, Xie, Zhanxu, Lu, Xiaoya, Qiao, Yu, Sha, Lei, Yan, Junchi, Ma, Lizhuang, Shao, Jing

arXiv.org Artificial Intelligence

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.


Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Russinovich, Mark, Salem, Ahmed, Eldan, Ronen

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies, progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we introduce Crescendomation, a tool that automates the Crescendo attack, and our evaluation showcases its effectiveness against state-of-the-art models.


Crescendo.ai - Data Science and AI R&D Firm

#artificialintelligence

Crescendo.ai is a data science firm with its own AI R&D center. We are a private Swiss initiative, with operations across Europe. We build and implement AI-powered solutions and tools to help public and private companies, make smarter decisions. With hundreds of thousands of valuable data entry across our own properties, we are developing powerful platforms and systems to help us make better and swifter decisions. Our goal is to allow public and private companies to tap into the incredible potential of AI and Machine Learning.


Why Music Makes Us Feel According to Artificial Intelligence

#artificialintelligence

Your heart beats faster, palms sweat and part of your brain called the Heschl's gyrus lights up like a Christmas tree. Chances are, you've never thought about what happens to your brain and body when you listen to music in such a detailed way. But it's a question that has puzzled scientists for decades: Why does something as abstract as music provoke such a consistent response? In a new study, a team of USC researchers, with the help of artificial intelligence, investigated how music affects listeners' brains, bodies and emotions. The research team looked at heart rate, galvanic skin response (or sweat gland activity), brain activity and subjective feelings of happiness and sadness in a group of volunteers as they listened to three pieces of unfamiliar music.


Why Music Makes Us Feel According to Artificial Intelligence

#artificialintelligence

Your heart beats faster, palms sweat and part of your brain called the Heschl's gyrus lights up like a Christmas tree. Chances are, you've never thought about what happens to your brain and body when you listen to music in such a detailed way. But it's a question that has puzzled scientists for decades: Why does something as abstract as music provoke such a consistent response? In a new study, a team of USC researchers, with the help of artificial intelligence, investigated how music affects listeners' brains, bodies and emotions. The research team looked at heart rate, galvanic skin response (or sweat gland activity), brain activity and subjective feelings of happiness and sadness in a group of volunteers as they listened to three pieces of unfamiliar music.