min 2
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Zhang, Andy K., Ji, Joey, Menders, Celeste, Dulepet, Riya, Qin, Thomas, Wang, Ron Y., Wu, Junrong, Liao, Kyleen, Li, Jiliang, Hu, Jinghan, Hong, Sara, Demilew, Nardos, Murgai, Shivatmica, Tran, Jason, Kacheria, Nishka, Ho, Ethan, Liu, Denis, McLane, Lauren, Bruvik, Olivia, Han, Dai-Rong, Kim, Seungwoo, Vyas, Akhil, Chen, Cuiyuanxiu, Li, Ryan, Xu, Weiran, Ye, Jonathan Z., Choudhary, Prerit, Bhatia, Siddharth M., Sivashankar, Vikram, Bao, Yuxuan, Song, Dawn, Boneh, Dan, Ho, Daniel E., Liang, Percy
AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.
- Research Report > New Finding (0.92)
- Research Report > Experimental Study (0.67)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.70)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > Singapore (0.04)
- Europe > France (0.04)
- North America > United States (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > Singapore (0.04)
Supplementary materials for " A Stochastic Path-Integrated Differential EstimatoR Expectation Maximization Algorithm "
By convention, vectors are column vectors. We first compare the complexities of the incremental EM based methods using the following table which summarizes the state-of-the-art results. The last column is the optimal complexity to reach an -approximate stationary point. Next, we provide the psuedo-codes of several existing incremental EM-based algorithms, following the notations defined in the main paper. Using Lemma 3 below this page, we deduce that SPIDER-EM can be equivalently described by the following algorithm 7 .
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
7a006957be65e608e863301eb98e1808-Supplemental.pdf
In Appendix A, we review some statistical results for sparse linear regression. In Appendix B, we provide the proof of main theorems as well as main claims. We review some classical results in sparse linear regression. B.1 Proof of Claim 3.5 We first prove the first part. Combining with Eq. (B.6), we have under event D B.2 Proof of Claim 3.6 From the divergence decomposition lemma (Lemma C.2 in the appendix), we have KLnull P To prove the claim, we use a simple argument "minimum is always smaller than the average".
Convergence of Actor-Critic Methods with Multi-Layer Neural Networks
The early theory of actor-critic methods considered convergence using linear function approximators for the policy and value functions. Recent work has established convergence using neural network approximators with a single hidden layer. In this work we are taking the natural next step and establish convergence using deep neural networks with an arbitrary number of hidden layers, thus closing a gap between theory and practice. We show that actor-critic updates projected on a ball around the initial condition will converge to a neighborhood where the average of the squared gradients is O (1 / m) + O (ϵ), with m being the width of the neural network and ϵ the approximation quality of the best critic neural network over the projected set.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Why Optimism
We illustrate this in the setting of binary classification that we work with. We now show the regret of any optimistic algorithm can be upper bounded by the model's estimation In this section we prove the results stated in Theorem 1. The logistic function µ is 1 / 4 Lipschitz. R centered around point x . In this section we will make the following assumptions.