Red-teaming Activation Probes using Prompted LLMs

Nov-4-2025–arXiv.org Artificial Intelligence

Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOT A probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- Europe > Latvia
  - Lubāna Municipality > Lubāna (0.04)
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Government (0.94)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)