PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight

Goertzel, Ben, Yibelo, Paulos

arXiv.org Artificial Intelligence 

Prompt injection attacks have emerged as a serious threat in curr ent large language models (LLMs), where adversaries may alter model behav ior by injecting malicious instructions into the prompt [2]. Existing approach es - such as input sanitization, fixed prompt templates, and heuristic-based filtering - often mix trusted system instructions with untrusted us er inputs, leading to brittle defenses that are easily circumvented. For examp le, an adversary could include a cleverly worded request that causes the model to "forget its internal guidelines," thereby triggering unintended beh avior. Our PICO (Prompt Isolation and Cybersecurity Oversight) propos al circumvents these limitations, first of all, by architecturally segregat ing the system prompt and user input into distinct channels. In doing so, we ensure that the trusted instructions remain intact while only the untruste d user input is subject to adaptation. Furthermore, we augment the mode l with a dedicated Security Expert Agent and a Cybersecurity Knowledge G raph [4] to provide supplemental, domain-specific signals that reinforce the invariant. In what follows, we first present a mathematical formalization of th e PICO security strategy, and then we describe its concrete realiza tion, both via PICO-based retraining of transformer models from the bottom up, and via a more efficient if less ideal fine-tuning strategy. We flesh out theapproach by considering how it would be expected to handle two specific example situations, including a basic prompt injection and then a subtler Policy Puppetry attack.