Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Samvelyan, Mikayel, Raparthy, Sharath Chandra, Lupu, Andrei, Hambro, Eric, Markosyan, Aram H., Bhatt, Manish, Mao, Yuning, Jiang, Minqi, Parker-Holder, Jack, Foerster, Jakob, Rocktäschel, Tim, Raileanu, Roberta
–arXiv.org Artificial Intelligence
Large language models (LLMs) have recently experienced remarkable growth in both their capabilities (OpenAI, 2023; Gemini Team et al., 2023; Touvron et al., 2023) and their applications in various fields (NLLB Team et al., 2022; Thirunavukarasu et al., 2023; Schick et al., 2023; Bubeck et al., 2023). As LLMs become increasingly complex and are deployed in safety-critical environments (Singhal et al., 2022; Li et al., 2023; Maddela et al., 2023), it is essential to thoroughly understand their robustness to different inputs. Indeed, the susceptibility of LLMs to user inputs and adversarial prompts -- prompts crafted to mislead the model or exploit its weaknesses, potentially leading to unsafe, biased, or incorrect outputs -- poses a significant challenge (Perez et al., 2022; Wei et al., 2023; Zou et al., 2023). Identifying these vulnerabilities and subsequently mitigating such risks is therefore vital to ensure the safe and reliable operation of LLMs in the real world. Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.
arXiv.org Artificial Intelligence
Feb-26-2024
- Country:
- Europe > France (0.14)
- North America
- Canada (0.14)
- United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Government > Military (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Law Enforcement & Public Safety (0.68)
- Technology: