Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Yang, Yijun, Wang, Lichao, Yang, Xiao, Hong, Lanqing, Zhu, Jun

Feb-8-2025–arXiv.org Artificial Intelligence

Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-8-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Mexico (0.28)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)

Genre:
- Research Report
  - New Finding (0.67)
  - Promising Solution (0.48)

Industry:
- Government > Military (1.00)
- Health & Medicine > Therapeutic Area
  - Immunology (0.46)
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)