Themisto: Jupyter-Based Runtime Benchmark

Grotov, Konstantin, Titov, Sergey

arXiv.org Artificial Intelligence 

A BSTRACT In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context. 1 I NTRODUCTION Recent developments in code completion and generation have been significant. Over the past several years, the field has progressed from generating relatively simple programs (Chen et al., 2021) to solving real-world issues within software repositories (Jimenez et al., 2023). However, most studies in this area are based on static snapshots of code (Jiang et al., 2024), with only a small body of research exploring the potential of leveraging dynamic code properties, such as runtime information and memory state, for code generation (Chen et al., 2024). A key reason for this limitation is that common programming environments rarely allow code generation during execution, which is when runtime information can be gathered.