AITopics | smoothllm

Collaborating Authors

smoothllm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Round Trip Translation Defence against Large Language Model Jailbreaking Attacks

Yung, Canaan, Dolatabadi, Hadi Mohaghegh, Erfani, Sarah, Leckie, Christopher

arXiv.org Artificial IntelligenceFeb-20-2024

Large language models (LLMs) are susceptible to social-engineered attacks that are human-interpretable but require a high level of comprehension for LLMs to counteract. Existing defensive measures can only mitigate less than half of these attacks at most. To address this issue, we propose the Round Trip Translation (RTT) method, the first algorithm specifically designed to defend against social-engineered attacks on LLMs. RTT paraphrases the adversarial prompt and generalizes the idea conveyed, making it easier for LLMs to detect induced harmful behavior. This method is versatile, lightweight, and transferrable to different LLMs. Our defense successfully mitigated over 70% of Prompt Automatic Iterative Refinement (PAIR) attacks, which is currently the most effective defense to the best of our knowledge. We are also the first to attempt mitigating the MathsAttack and reduced its attack success rate by almost 40%. Our code is publicly available at https://github.com/Cancanxxx/Round_Trip_Translation_Defence

adversarial prompt, llm, rtt, (13 more...)

arXiv.org Artificial Intelligence

2402.13517

Country:

Oceania > Australia (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, Alexander, Wong, Eric, Hassani, Hamed, Pappas, George J.

arXiv.org Machine LearningNov-29-2023

Over the last year, large language models (LLMs) have emerged as a groundbreaking technology that has the potential to fundamentally reshape how people interact with AI. Central to the fervor surrounding these models is the credibility and authenticity of the text they generate, which is largely attributable to the fact that LLMs are trained on vast text corpora sourced directly from the Internet. And while this practice exposes LLMs to a wealth of knowledge, such corpora tend to engender a double-edged sword, as they often contain objectionable content including hate speech, malware, and false information [1]. Indeed, the propensity of LLMs to reproduce this objectionable content has invigorated the field of AI alignment [2-4], wherein various mechanisms are used to "align" the output text generated by LLMs with ethical and legal standards [5-7]. At face value, efforts to align LLMs have reduced the propagation of toxic content: Publicly-available chatbots will now rarely output text that is clearly objectionable [8]. Yet, despite this encouraging progress, in recent months a burgeoning literature has identified numerous failure modes--commonly referred to as jailbreaks--that bypass the alignment mechanisms and safety guardrails implemented on modern LLMs [9, 10]. The pernicious nature of such jailbreaks, which are often difficult to detect or mitigate [11, 12], pose a significant barrier to the widespread deployment of LLMs, given that the text generated by these models may influence educational policy [13], medical diagnoses [14, 15], and business decisions [16]. Among the jailbreaks discovered so far, a notable category concerns adversarial prompting, wherein an attacker fools a targeted LLM into outputting objectionable content by modifying prompts passed as input to that LLM [17, 18]. Of particular concern is the recent work of [19], which shows that highly-performant LLMs, including GPT, Claude, and PaLM, can be jailbroken by appending adversarially-chosen characters onto various prompts.

arxiv preprint arxiv, smoothllm, suffix, (14 more...)

arXiv.org Machine Learning

2310.03684

Country:

North America > United States > Pennsylvania (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Czechia > Prague (0.04)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback