Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
Chen, Shuo, Han, Zhen, He, Bailan, Ding, Zifeng, Yu, Wenqian, Torr, Philip, Tresp, Volker, Gu, Jindong
–arXiv.org Artificial Intelligence
Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown superior performance in text generation. To avoid generating unobjectionable content learned from the web-scale training corpus, stringent safety regulations have been applied during the safety alignment (Ouyang et al., 2022; Touvron et al., 2023). However, many jailbreak attacks have been proven to be able to bypass these safeguards and successfully elicit harmful generations.
arXiv.org Artificial Intelligence
Apr-4-2024