Gradient-based Jailbreak Images for Multimodal Fusion Models

Rando, Javier, Korevaar, Hannah, Brinkman, Erik, Evtimov, Ivan, Tramèr, Florian

Oct-23-2024–arXiv.org Artificial Intelligence

Adapter-based vision language models were an early attempt to augment large language models (LLMs) with image inputs (Liu et al., 2024). They use a pretrained image embedding model, like CLIP (Radford et al., 2021), and train adapters to map image embeddings directly into the embedding space of a pretrained LLM. However, separate input spaces can limit multimodal understanding and do not support native generation of images. In contrast, early-fusion multimodal models have been introduced as a more general approach that supports unlimited modalities as both input and output (Chameleon Team, 2024; Gemini Team, 2023; OpenAI, 2024). These models project all modalities into a shared tokenized space and are pretrained from scratch on multimodal inputs. In this work, we will refer to early-fusion multimodal models as multimodal fusion models. Just like LLMs, most vision language models are trained to behave safely and reject harmful requests (Bai et al., 2022). Carlini et al. (2024) demonstrated that bypassing safeguards in adapter-based vision language models is easy because input images can be continuously optimized to maximize harmful outputs. This is in contrast to text input optimization, which requires less efficient discrete optimization methods (Zou et al., 2023).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-23-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Optimization (1.00)