Commit0: Library Generation from Scratch

Zhao, Wenting, Jiang, Nan, Lee, Celine, Chiu, Justin T, Cardie, Claire, Gallé, Matthias, Rush, Alexander M

Dec-2-2024–arXiv.org Artificial Intelligence

Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use. AI agents have been increasing rapidly in ability, particularly in domains such as problem-solving, math, and coding. Tasks related to software development have been particularly promising areas due to both their clarity of evaluation and economic value. This has motivated the release of several coding benchmarks in recent years (Hendrycks et al., 2021a; Chen et al., 2021; Zhuo et al., 2024). A notable example is SWE-bench (Jimenez et al., 2024), which assesses the ability of agents to generate patches to resolve real-world GitHub issues. While critical, these tasks generally remain within the skill set of an experienced software engineer. If LLM systems continue to improve at current rates, these tasks will be completely solvable. We are interested in benchmarks that exist further beyond both the frontier of expert human ability as well as current model ability.

large language model, machine learning, programming language, (20 more...)

arXiv.org Artificial Intelligence

Dec-2-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning (1.00)
  - Software > Programming Languages (1.00)