Red-teaming Activation Probes using Prompted LLMs
Blandfort, Phil, Graham, Robert
–arXiv.org Artificial Intelligence
Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOT A probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > Latvia
- Lubāna Municipality > Lubāna (0.04)
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Government (0.94)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.68)
- Technology: