SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, Alexander, Wong, Eric, Hassani, Hamed, Pappas, George J.

Nov-29-2023–arXiv.org Machine Learning

Over the last year, large language models (LLMs) have emerged as a groundbreaking technology that has the potential to fundamentally reshape how people interact with AI. Central to the fervor surrounding these models is the credibility and authenticity of the text they generate, which is largely attributable to the fact that LLMs are trained on vast text corpora sourced directly from the Internet. And while this practice exposes LLMs to a wealth of knowledge, such corpora tend to engender a double-edged sword, as they often contain objectionable content including hate speech, malware, and false information [1]. Indeed, the propensity of LLMs to reproduce this objectionable content has invigorated the field of AI alignment [2-4], wherein various mechanisms are used to "align" the output text generated by LLMs with ethical and legal standards [5-7]. At face value, efforts to align LLMs have reduced the propagation of toxic content: Publicly-available chatbots will now rarely output text that is clearly objectionable [8]. Yet, despite this encouraging progress, in recent months a burgeoning literature has identified numerous failure modes--commonly referred to as jailbreaks--that bypass the alignment mechanisms and safety guardrails implemented on modern LLMs [9, 10]. The pernicious nature of such jailbreaks, which are often difficult to detect or mitigate [11, 12], pose a significant barrier to the widespread deployment of LLMs, given that the text generated by these models may influence educational policy [13], medical diagnoses [14, 15], and business decisions [16]. Among the jailbreaks discovered so far, a notable category concerns adversarial prompting, wherein an attacker fools a targeted LLM into outputting objectionable content by modifying prompts passed as input to that LLM [17, 18]. Of particular concern is the recent work of [19], which shows that highly-performant LLMs, including GPT, Claude, and PaLM, can be jailbroken by appending adversarially-chosen characters onto various prompts.

arxiv preprint arxiv, smoothllm, suffix, (14 more...)

arXiv.org Machine Learning

Nov-29-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania (0.04)
- Europe
  - Czechia > Prague (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)

Genre:
- Instructional Material (1.00)
- Research Report > New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found