SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Robey, Alexander, Wong, Eric, Hassani, Hamed, Pappas, George J.
Over the last year, large language models (LLMs) have emerged as a groundbreaking technology that has the potential to fundamentally reshape how people interact with AI. Central to the fervor surrounding these models is the credibility and authenticity of the text they generate, which is largely attributable to the fact that LLMs are trained on vast text corpora sourced directly from the Internet. And while this practice exposes LLMs to a wealth of knowledge, such corpora tend to engender a double-edged sword, as they often contain objectionable content including hate speech, malware, and false information [1]. Indeed, the propensity of LLMs to reproduce this objectionable content has invigorated the field of AI alignment [2-4], wherein various mechanisms are used to "align" the output text generated by LLMs with ethical and legal standards [5-7]. At face value, efforts to align LLMs have reduced the propagation of toxic content: Publicly-available chatbots will now rarely output text that is clearly objectionable [8]. Yet, despite this encouraging progress, in recent months a burgeoning literature has identified numerous failure modes--commonly referred to as jailbreaks--that bypass the alignment mechanisms and safety guardrails implemented on modern LLMs [9, 10]. The pernicious nature of such jailbreaks, which are often difficult to detect or mitigate [11, 12], pose a significant barrier to the widespread deployment of LLMs, given that the text generated by these models may influence educational policy [13], medical diagnoses [14, 15], and business decisions [16]. Among the jailbreaks discovered so far, a notable category concerns adversarial prompting, wherein an attacker fools a targeted LLM into outputting objectionable content by modifying prompts passed as input to that LLM [17, 18]. Of particular concern is the recent work of [19], which shows that highly-performant LLMs, including GPT, Claude, and PaLM, can be jailbroken by appending adversarially-chosen characters onto various prompts.
Nov-29-2023
- Country:
- Genre:
- Instructional Material (1.00)
- Research Report > New Finding (0.93)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: