Jailbreaking Attack against Multimodal Large Language Model

Niu, Zhenxing, Ren, Haodong, Gao, Xinbo, Hua, Gang, Jin, Rong

Feb-3-2024–arXiv.org Artificial Intelligence

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-3-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Industry:
- Banking & Finance (0.94)
- Energy > Oil & Gas (0.67)
- Government
  - Military (0.94)
  - Voting & Elections (0.68)
- Information Technology > Security & Privacy (1.00)
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety
  - Crime Prevention & Enforcement (1.00)
  - Fraud (0.68)
  - Terrorism (0.68)
- Media > News (0.67)
- Transportation (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)