How Not to Detect Prompt Injections with an LLM
Choudhary, Sarthak, Anshumaan, Divyam, Palumbo, Nils, Jha, Somesh
–arXiv.org Artificial Intelligence
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as $0\%$ while reliably inducing malicious behavior with a success rate of $91\%$, all without requiring white-box access to the LLM or any optimization procedures.
arXiv.org Artificial Intelligence
Dec-9-2025
- Country:
- Asia > Taiwan
- Taiwan Province > Taipei (0.05)
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Spain > Catalonia
- North America > United States
- California > Santa Barbara County
- Santa Barbara (0.04)
- New York > New York County
- New York City (0.04)
- Wisconsin > Dane County
- Madison (0.14)
- California > Santa Barbara County
- South America > Chile
- Asia > Taiwan
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: