Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Kang, Mintong, Xiang, Chong, Kariyappa, Sanjay, Xiao, Chaowei, Li, Bo, Suh, Edward

arXiv.org Artificial Intelligence 

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLMpowered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of Intent-Guard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario). Indirect prompt injection attacks (IPIAs) (Greshake et al., 2023), where large language models (LLMs) follow malicious instructions hidden in the input data, have emerged as a top security concern for LLM-powered agents. Although many defenses have been proposed, each faces fundamental limitations. Finetuning-based defenses (Chen et al., 2024; 2025b) are costly and lack interpretability; auxiliary classifiers for IPIA detection Shi et al. (2025); Hung et al. (2024) often fail to generalize and are vulnerable to adaptive attacks; system-level rule enforcement Debenedetti et al. (2025) can impact agent utility while offering little robustness against attacks that do not alter control and data flows (e.g., injecting misinformation or phishing links into an email summary). In this paper, we approach the prompt injection problem from a new perspective: instruction-following intent analysis. For an LLM to effectively follow instructions, it must have an internal mechanism to decide which parts of a prompt it recognizes as actionable instructions.