Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
Zhang, Boxuan, Yu, Yi, Guo, Jiaxuan, Shao, Jing
–arXiv.org Artificial Intelligence
The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate (OR) and Aggregate Overuse Count (AOC) metrics, which precisely capture the frequency and severity of uncontrolled replication. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents. The rapid advancement of large language models (LLMs) has propelled LLM agents into widespread deployment in various domains, including code generation, web-based application (Maslej et al., 2025; He et al., 2025a;c). As LLM agents take on critical tasks and interact with complex environments, they are often granted extensive operational permissions. While this combination of increased capability and operational permissions offers transformative potential, it also raises safety concerns (OpenAI, 2024b; Anthropic, 2023; Betley et al., 2025). Researchers are worried about the emerging safety risks of LLM agents' self-replication (OpenAI, 2024a; 2025; Black et al., 2025). Prior studies on LLM self-replication risks have mainly focused on measuring the capability (verbalized success rate) of self-replication, either through direct instructions or within synthetic capability benchmarks (Pan et al., 2024; 2025; Kran et al., 2025; Black et al., 2025).
arXiv.org Artificial Intelligence
Oct-1-2025
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (0.34)
- Leisure & Entertainment (0.54)
- Media > Film (0.54)
- Technology: