CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
Jansen, Peter, Hassan, Samiah, Narasimha, Pragnya
–arXiv.org Artificial Intelligence
Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples.
arXiv.org Artificial Intelligence
Dec-2-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.14)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- North America > United States
- Arizona (0.05)
- Florida > Miami-Dade County
- Miami (0.04)
- Genre:
- Research Report (0.82)
- Industry:
- Government > Regional Government (0.46)
- Technology: