Panfilov, Alexander
ASIDE: Architectural Separation of Instructions and Data in Language Models
Zverev, Egor, Kortukov, Evgenii, Panfilov, Alexander, Tabesh, Soroush, Volkova, Alexandra, Lapuschkin, Sebastian, Samek, Wojciech, Lampert, Christoph H.
Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations. Large language models (LLMs) are commonly associated with interactive open-ended chat applications, such as ChatGPT. However, in many practical applications LLMs are integrated as a component into larger software systems. Their rich natural language understanding abilities allow them to be used for text analysis and generation, translation, document summarization, or information retrieval (Zhao et al., 2023). In all of these scenarios, the system is given instructions, for example as a system prompt, and data, for example, a user input or an uploaded document. These two forms of input play different roles: the instruction should be executed, determining the behavior of the model. The data should be processed, i.e., transformed to become the output of the system.
A Realistic Threat Model for Large Language Model Jailbreaks
Boreiko, Valentyn, Panfilov, Alexander, Voracek, Vaclav, Hein, Matthias, Geiping, Jonas
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
Provable Compositional Generalization for Object-Centric Learning
Wiedemer, Thaddรคus, Brady, Jack, Panfilov, Alexander, Juhos, Attila, Bethge, Matthias, Brendel, Wieland
Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.
Multi-step domain adaptation by adversarial attack to $\mathcal{H} \Delta \mathcal{H}$-divergence
Asadulaev, Arip, Panfilov, Alexander, Filchenkov, Andrey
Adversarial examples are transferable between different models. In our paper, we propose to use this property for multi-step domain adaptation. In unsupervised domain adaptation settings, we demonstrate that replacing the source domain with adversarial examples to $\mathcal{H} \Delta \mathcal{H}$-divergence can improve source classifier accuracy on the target domain. Our method can be connected to most domain adaptation techniques. We conducted a range of experiments and achieved improvement in accuracy on Digits and Office-Home datasets.