GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Forough, Javad, Maheri, Mohammad, Haddadi, Hamed

arXiv.org Artificial Intelligence 

Abstract--Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments. I. Introduction Large Language Models (LLMs) have become central to a wide range of applications, powering systems in domains such as education [1], healthcare [2], finance [3], law [4], and customer support [5]. Their ability to understand and generate human-like text has enabled automation of complex tasks such as legal reasoning, clinical triage, financial analysis, and policy drafting. However, this general-purpose capability also makes LLMs vulnerable to misuse. In particular, LLMs are highly susceptible to prompt-based adversarial attacks, especially jailbreak prompts [6], [7], which are carefully engineered inputs designed to bypass alignment constraints and elicit unauthorized or harmful responses.