Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Samvelyan, Mikayel, Raparthy, Sharath Chandra, Lupu, Andrei, Hambro, Eric, Markosyan, Aram H., Bhatt, Manish, Mao, Yuning, Jiang, Minqi, Parker-Holder, Jack, Foerster, Jakob, Rocktäschel, Tim, Raileanu, Roberta

Feb-26-2024–arXiv.org Artificial Intelligence

Large language models (LLMs) have recently experienced remarkable growth in both their capabilities (OpenAI, 2023; Gemini Team et al., 2023; Touvron et al., 2023) and their applications in various fields (NLLB Team et al., 2022; Thirunavukarasu et al., 2023; Schick et al., 2023; Bubeck et al., 2023). As LLMs become increasingly complex and are deployed in safety-critical environments (Singhal et al., 2022; Li et al., 2023; Maddela et al., 2023), it is essential to thoroughly understand their robustness to different inputs. Indeed, the susceptibility of LLMs to user inputs and adversarial prompts -- prompts crafted to mislead the model or exploit its weaknesses, potentially leading to unsafe, biased, or incorrect outputs -- poses a significant challenge (Perez et al., 2022; Wei et al., 2023; Zou et al., 2023). Identifying these vulnerabilities and subsequently mitigating such risks is therefore vital to ensure the safe and reliable operation of LLMs in the real world. Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-26-2024

arXiv.org PDF

Add feedback

Country:
- Europe > France (0.14)
- North America
  - Canada (0.14)
  - United States (0.14)

Genre:
- Research Report (1.00)

Industry:
- Government > Military (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Law Enforcement & Public Safety (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.92)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found