What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces
Armengol-Estapé, Jordi, Carbonneaux, Quentin, Zhang, Tianjun, Markosyan, Aram H., Seeker, Volker, Cummins, Chris, Kambadur, Melanie, O'Boyle, Michael F. P., Wang, Sida, Synnaeve, Gabriel, Leather, Hugh James
–arXiv.org Artificial Intelligence
O'Boyle 1, Sida Wang 2, Gabriel Synnaeve 2, Hugh Leather 2 1 University of Edinburgh 2 Meta AI A BSTRACT Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications. Current state-of-the-art general-purpose LLMs are thought to contain considerable proportions of code in their pretraining data (OpenAI et al., 2024), which is known to improve reasoning capabilities even in tasks seemingly unrelated to code (Aryabumi et al., 2024). However, datasets used to train code LLMs (such as Lozhkov et al. (2024)) typically treat code as static strings and rarely exploit the dynamic information about their execution. Executability is one of the key differences between code and natural language, and most code datasets neglect dimensions of the code domain such as reasoning over code execution, which in turn could lead to better code understanding. This fundamental limitation has sparked a renewed interest in modeling program executions, connecting with the pre-LLM neural program evaluation literature (Zaremba & Sutskever, 2014; Graves et al., 2014), which studied whether neural networks could learn to execute programs.
arXiv.org Artificial Intelligence
Feb-10-2025