Themisto: Jupyter-Based Runtime Benchmark

Apr-18-2025–arXiv.org Artificial Intelligence

A BSTRACT In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context. 1 I NTRODUCTION Recent developments in code completion and generation have been significant. Over the past several years, the field has progressed from generating relatively simple programs (Chen et al., 2021) to solving real-world issues within software repositories (Jimenez et al., 2023). However, most studies in this area are based on static snapshots of code (Jiang et al., 2024), with only a small body of research exploring the potential of leveraging dynamic code properties, such as runtime information and memory state, for code generation (Chen et al., 2024). A key reason for this limitation is that common programming environments rarely allow code generation during execution, which is when runtime information can be gathered.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.32)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found