Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
Jha, Rishi, Triedman, Harold, Wagle, Justin, Shmatikov, Vitaly
–arXiv.org Artificial Intelligence
Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. LLM-based "agents" equipped with tools for querying APIs, searching the Web, and executing code promise to automate many digital tasks. Popular frameworks like AutoGen (Microsoft, 2025), OpenManus (OpenManus, 2025), CrewAI (CrewAI, 2025), and MetaGPT (MetaGPT, 2025) enable design and deployment of multi-agent systems (MAS). The key principle in MAS is delegation. Given a relatively complex task (e.g., "organize an offsite given team members' calendars, managers' private messages, and Web data about attractions and weather"), MAS can plan how to solve it, delegate sub-tasks to specialized agents, evaluate their responses, and adaptively re-plan if necessary. Delegation splits fulfilling a task into chunks that are (a) hidden within individual agents (e.g., how to access a website or read a file), and (b) joined into the overall plan by an orchestrator who does not observe the execution of sub-tasks, only their results as reported by other agents. Critically, there is no single vantage point in the system where the entire context is visible. This exposes them to indirect prompt injection, or IPI (Greshake et al., 2023), i.e., malicious instructions in the content they ingest (Constantin, 2025; Karliner, 2025; Ravia, 2025; Abu, 2025). Aligning individual agents to resist IPI is not enough. Triedman et al. (2025) demonstrated control-flow hijacking (CFH) attacks that exploit confused-deputy vulnerabilities (Hardy, 1988) in otherwise aligned agents. CFH attacks masquerade as legitimate errors (e.g., failure to parse a file), along with seemingly helpful instructions on how to fix the issue and continue with the user's task. MAS orchestrators receive these instructions from a trusted agent to which they delegated an essential sub-task and rely on them to re-plan the execution and invoke unsafe agents as (indirectly) requested by the attacker.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- North America > United States (0.28)
- Genre:
- Overview (0.46)
- Research Report (0.40)
- Industry:
- Law Enforcement & Public Safety > Terrorism (1.00)
- Information Technology > Security & Privacy (1.00)
- Technology: