AITopics | Carbonneaux, Quentin

Collaborating Authors

Carbonneaux, Quentin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Wei, Yuxiang, Duchenne, Olivier, Copet, Jade, Carbonneaux, Quentin, Zhang, Lingming, Fried, Daniel, Synnaeve, Gabriel, Singh, Rishabh, Wang, Sida I.

arXiv.org Artificial IntelligenceFeb-25-2025

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.18449

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Armengol-Estapé, Jordi, Carbonneaux, Quentin, Zhang, Tianjun, Markosyan, Aram H., Seeker, Volker, Cummins, Chris, Kambadur, Melanie, O'Boyle, Michael F. P., Wang, Sida, Synnaeve, Gabriel, Leather, Hugh James

arXiv.org Artificial IntelligenceFeb-10-2025

O'Boyle 1, Sida Wang 2, Gabriel Synnaeve 2, Hugh Leather 2 1 University of Edinburgh 2 Meta AI A BSTRACT Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications. Current state-of-the-art general-purpose LLMs are thought to contain considerable proportions of code in their pretraining data (OpenAI et al., 2024), which is known to improve reasoning capabilities even in tasks seemingly unrelated to code (Aryabumi et al., 2024). However, datasets used to train code LLMs (such as Lozhkov et al. (2024)) typically treat code as static strings and rarely exploit the dynamic information about their execution. Executability is one of the key differences between code and natural language, and most code datasets neglect dimensions of the code domain such as reasoning over code execution, which in turn could lead to better code understanding. This fundamental limitation has sparked a renewed interest in modeling program executions, connecting with the pre-LLM neural program evaluation literature (Zaremba & Sutskever, 2014; Graves et al., 2014), which studied whether neural networks could learn to execute programs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.05703

Country: Asia > Middle East (0.14)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback