Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Baker, Mohammed Abu, Babu-Saheer, Lakshmi

arXiv.org Artificial Intelligence 

Recent advances in artificial intelligence (AI), particularly in the domain of large language models (LLMs), have significantly amplified concerns around AI safety and security. One critical aspect of these concerns is the vulnerability of LLMs to backdoor attacks--a malicious strategy whereby an attacker injects specific triggers into training data, resulting in "sleeper agents" that behave normally until activated by particular inputs [6]. These backdoored models (also known as sleeper agents or trojaned models) pose a serious threat as they cannot be detected by standard evaluation methods and manifest undesirable or harmful behaviors only upon exposure to particular triggers in the input [2]. Triggers can take on many forms, ranging from simple single-token lexical triggers to complex semantic triggers [9]. The significance of studying backdoor vulnerabilities arises from two primary threat models: Data-poisoned sleeper agents These involve deliberate poisoning of the training data to trigger specific harmful behaviors under attacker-defined conditions [3]. Real-world implications are substantial; for instance, autonomous vehicles might misinterpret modified road signs, potentially leading to fatal accidents, or software coding assistants might generate insecure code when prompted by certain organisations, making the organisations software systems vulnerable to attack if the generated code is not carefully inspected [3]. Deceptive instrumental alignment Plausibly, models could develop deceptive behaviors organically during training [8]. These models exhibit compliant behaviors in training and evaluation phases but deviate from their developer-defined goals once deployed. While naturally occurring deceptive models have not yet been reported, the training process does select for such behaviour [6].