RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Wen, Yuxin, Zharmagambetov, Arman, Evtimov, Ivan, Kokhlikyan, Narine, Goldstein, Tom, Chaudhuri, Kamalika, Guo, Chuan

Oct-7-2025–arXiv.org Artificial Intelligence

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT -4o and a 72% ASR against GPT -5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to "reward-hack" diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. More recently, a new paradigm has emerged that allows LLMs to behave as autonomous agents in complex environments, including full-fledged operating systems, integrated software platforms, and multi-step tool pipelines. In these contexts, LLMs can function as coding assistants, system administrators, and even academic researchers. Notable examples include Microsoft Copilot (GitHub, 2025), Anthropic Claude Computer Use (Anthropic, 2024), OpenAI Operator (OpenAI, 2025), and Zochi (Intology, 2025), each demonstrating the potential to combine sophisticated reasoning with direct system control. As these capabilities continue to advance, LLM agents are expected to be integrated into an even broader range of systems, becoming indispensable in both consumer and enterprise applications. However, these capabilities also introduce significant security risks, most notably prompt injection.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Industry:
- Information Technology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found