Gradient-based Jailbreak Images for Multimodal Fusion Models
Rando, Javier, Korevaar, Hannah, Brinkman, Erik, Evtimov, Ivan, Tramèr, Florian
–arXiv.org Artificial Intelligence
Adapter-based vision language models were an early attempt to augment large language models (LLMs) with image inputs (Liu et al., 2024). They use a pretrained image embedding model, like CLIP (Radford et al., 2021), and train adapters to map image embeddings directly into the embedding space of a pretrained LLM. However, separate input spaces can limit multimodal understanding and do not support native generation of images. In contrast, early-fusion multimodal models have been introduced as a more general approach that supports unlimited modalities as both input and output (Chameleon Team, 2024; Gemini Team, 2023; OpenAI, 2024). These models project all modalities into a shared tokenized space and are pretrained from scratch on multimodal inputs. In this work, we will refer to early-fusion multimodal models as multimodal fusion models. Just like LLMs, most vision language models are trained to behave safely and reject harmful requests (Bai et al., 2022). Carlini et al. (2024) demonstrated that bypassing safeguards in adapter-based vision language models is easy because input images can be continuously optimized to maximize harmful outputs. This is in contrast to text input optimization, which requires less efficient discrete optimization methods (Zou et al., 2023).
arXiv.org Artificial Intelligence
Oct-23-2024