EnvBench: A Benchmark for Automated Environment Setup
Eliseeva, Aleksandra, Kovrigin, Alexander, Kholkin, Ilia, Bogomolov, Egor, Zharov, Yaroslav
–arXiv.org Artificial Intelligence
Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories--environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that E The dataset and experiment trajectories are available at https://jb.gg/envbench. Recent advances in Large Language Models (LLMs) have enabled their application across many domains, including software engineering (Fan et al., 2023). In this work, we focus on another repository-level task that programmers face regularly-- environment setup, i.e., configuring the system to work with an arbitrary software project, for instance, a freshly cloned GitHub repository. It usually entails installing the dependencies but might include arbitrary project-specific steps, such as installing additional system packages, setting the correct environment variables, and more. A well-maintained project should be straightforward to set up, however, in practice, it is not always the case. For instance, setting up the repository is perceived to be the most challenging part of reproducing Natural Language Processing (NLP) research results, according to Storks et al. (2023), it may take up to several hours.
arXiv.org Artificial Intelligence
Mar-18-2025