Carbonneaux, Quentin
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Wei, Yuxiang, Duchenne, Olivier, Copet, Jade, Carbonneaux, Quentin, Zhang, Lingming, Fried, Daniel, Synnaeve, Gabriel, Singh, Rishabh, Wang, Sida I.
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces
Armengol-Estapé, Jordi, Carbonneaux, Quentin, Zhang, Tianjun, Markosyan, Aram H., Seeker, Volker, Cummins, Chris, Kambadur, Melanie, O'Boyle, Michael F. P., Wang, Sida, Synnaeve, Gabriel, Leather, Hugh James
O'Boyle 1, Sida Wang 2, Gabriel Synnaeve 2, Hugh Leather 2 1 University of Edinburgh 2 Meta AI A BSTRACT Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications. Current state-of-the-art general-purpose LLMs are thought to contain considerable proportions of code in their pretraining data (OpenAI et al., 2024), which is known to improve reasoning capabilities even in tasks seemingly unrelated to code (Aryabumi et al., 2024). However, datasets used to train code LLMs (such as Lozhkov et al. (2024)) typically treat code as static strings and rarely exploit the dynamic information about their execution. Executability is one of the key differences between code and natural language, and most code datasets neglect dimensions of the code domain such as reasoning over code execution, which in turn could lead to better code understanding. This fundamental limitation has sparked a renewed interest in modeling program executions, connecting with the pre-LLM neural program evaluation literature (Zaremba & Sutskever, 2014; Graves et al., 2014), which studied whether neural networks could learn to execute programs.