Goto

Collaborating Authors

 Ren, Qibing


Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

arXiv.org Artificial Intelligence

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.


CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

arXiv.org Artificial Intelligence

The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a new and universal safety vulnerability of these models against code input: CodeAttack bypasses the safety guardrails of all models more than 80\% of the time. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures. Furthermore, we give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. Finally, we analyze potential mitigation measures. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.


Mind Your Solver! On Adversarial Attack and Defense for Combinatorial Optimization

arXiv.org Artificial Intelligence

It is worth noting that many challenging task not only in its inherent CO problems can be essentially formulated as a graph problem complexity (e.g. NP-hard) but also the possible (Khalil et al., 2017; Bengio et al., 2020), hence it is sensitivity to input conditions. In this paper, we attractive and natural to modify the problem instance by take an initiative on developing the mechanisms modifying the graph structure, to generate more test cases for adversarial attack and defense towards combinatorial for solvers. In fact, vulnerability can often be an inherent optimization solvers, whereby the solver challenge for CO solvers since the problem is often strong is treated as a black-box function and the original nonlinear and NP-hard. From this perspective, we consider problem's underlying graph structure (which is attack and defense CO solvers in the following aspects.