Universal Adversarial Attack on Aligned Multimodal LLMs

Rahmatullaev, Temurbek, Druzhinina, Polina, Mikhalchuk, Matvey, Kuznetsov, Andrey, Razzhigaev, Anton

Feb-13-2025–arXiv.org Artificial Intelligence

We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., ''Sure, here it is'') or otherwise unsafe content-even for harmful prompts. In experiments on the SafeBench benchmark, our method achieves significantly higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 93% on certain models). We further demonstrate cross-model transferability by training on several multimodal LLMs simultaneously and testing on unseen architectures. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive to some readers.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

Feb-13-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.68)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Government
  - Military (0.65)
  - Tax (0.69)
- Information Technology > Security & Privacy (0.75)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)