Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Zanbaghi, Shahin, Rostampour, Ryan, Abid, Farhan, Jarmakani, Salim Al

Nov-21-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

backdoor, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Nov-21-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada > Ontario (0.14)

Genre:
- Research Report > New Finding (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Performance Analysis > Accuracy (0.96)
    - Neural Networks > Deep Learning (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found