Agentic Reinforcement Learning for Search is Unsafe

Yang, Yushi, Padarha, Shreyansh, Lee, Andrew, Mahdi, Adam

Oct-21-2025–arXiv.org Artificial Intelligence

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search. Instruction tuning (IT) is the standard method to align large language models (LLMs) with human preferences and teach them to refuse harmful requests (Schulman et al., 2017; Shao et al., 2024). However, IT only shapes static responses and is insufficient in agentic settings, where models must also decide when and how to call external tools, capabilities not explicitly learned during pre-training (Zhang et al., 2025). Agentic reinforcement learning (RL) for tool-use (Zhang et al., 2025) tackles this by fine-tuning models to interleave reasoning with tool use (Dong et al., 2025). In practice, search is the most common tool: agentic RL rewards effective, well-timed search queries and achieves strong gains on multi-hop reasoning tasks (Song et al., 2025a;b; Jin et al., 2025). Despite the progress, effect of agentic RL on safety of IT models remains unclear. While prior work reported safety degradation of retrieval-augmented agents (Y u et al., 2025), little is known about whether agentic RL for search preserves refusal of harmful requests. As agentic RL is now deployed in closed-source systems such as OpenAI's DeepSearch (OpenAI, 2025), this evaluation gap can create real deployment risks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.88)

Industry:
- Law Enforcement & Public Safety (0.68)
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.44)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found