Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Samvelyan, Mikayel, Raparthy, Sharath Chandra, Lupu, Andrei, Hambro, Eric, Markosyan, Aram H., Bhatt, Manish, Mao, Yuning, Jiang, Minqi, Parker-Holder, Jack, Foerster, Jakob, Rocktäschel, Tim, Raileanu, Roberta

arXiv.org Artificial Intelligence 

Large language models (LLMs) have recently experienced remarkable growth in both their capabilities (OpenAI, 2023; Gemini Team et al., 2023; Touvron et al., 2023) and their applications in various fields (NLLB Team et al., 2022; Thirunavukarasu et al., 2023; Schick et al., 2023; Bubeck et al., 2023). As LLMs become increasingly complex and are deployed in safety-critical environments (Singhal et al., 2022; Li et al., 2023; Maddela et al., 2023), it is essential to thoroughly understand their robustness to different inputs. Indeed, the susceptibility of LLMs to user inputs and adversarial prompts -- prompts crafted to mislead the model or exploit its weaknesses, potentially leading to unsafe, biased, or incorrect outputs -- poses a significant challenge (Perez et al., 2022; Wei et al., 2023; Zou et al., 2023). Identifying these vulnerabilities and subsequently mitigating such risks is therefore vital to ensure the safe and reliable operation of LLMs in the real world. Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found