Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers
Bertollo, Giacomo, Bodemir, Naz, Burgess, Jonah
–arXiv.org Artificial Intelligence
AI red teaming brings security thinking to LLM applications by probing failure modes such as prompt injection, output manipulation, and sensitive data exfiltration. While automated and curated benchmarks (e.g., JailbreakBench [1], HarmBench [2]) are increasingly used to test models and defenses, comparatively fewer studies analyze community scale behavior in the wild. We study ai_gon3_rogu3 [3], a 10 day competition with 504 registrants and 217 active players, to quantify solve dynamics, tactic stratification, and choke points across 11 challenges. We find sharp skill stratification, higher success for output manipulation than for data extraction, and strong effects of format obfuscation tactics, with multi step defenses remaining robust, among other insights.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Government (0.69)
- Information Technology > Security & Privacy (1.00)
- Technology: