python
SWE-smith: Scaling Data for Software Engineering Agents
Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets are available at https://swesmith.com.
Multi-SWE-bench: AMultilingual Benchmark for Issue Resolving
The task of issue resolving aims to modify a codebase to generate a patch that addresses a given issue. However, most existing benchmarks focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across different programming languages. To bridge this gap, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering 8 widely used programming languages: Python, Java, TypeScript, JavaScript, Go, Rust, C, and C++. In particular, this benchmark includes a total of 2,132 highquality instances, carefully curated by 68 expert annotators, ensuring a reliable and accurate evaluation of LLMs on the issue-resolving task. Based on humanannotated results, the issues are further classified into three difficulty levels. We evaluate a series of state-of-the-art models on Multi-SWE-bench, utilizing both procedural and agent-based frameworks for issue resolving. Experimental results based on Multi-SWE-bench reveal three key findings: (1) Limited generalization across languages: While existing LLMs perform well on Python issues, their ability to generalize across other languages remains limited; (2) Performance aligned with human-annotated difficulty: LLM-based agents' performance closely aligns with human-assigned difficulty, with resolved rates notably decreasing as issue complexity rises; and (3) Performance drop on cross-file issues: The performance of current methods significantly deteriorates when handling cross-file issues. These findings highlight the limitations of current LLMs and underscore the need for more robust models capable of handling a broader range of programming languages and complex issue scenarios.
Repo2Run: Automated Building Executable Environment for Code Repository at Scale
Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming, and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0%
EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EffiBench X, the first large scale multi language benchmark specifically designed for robust efficiency evaluation of LLM generated code. EffiBench X supports Python, C++, Java, JavaScript, Ruby, and Go, and comprises competitive programming tasks paired with human expert solutions as efficiency baselines. Evaluating state of the art LLMs on EffiBench X reveals that while models frequently generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM generated solutions (e.g., Qwen3 32B) achieve only around 62% of human efficiency on average, with significant language specific variation: models tend to perform better in Python, Ruby, and JavaScript than in Java, C++, and Go (e.g., DeepSeek R1's Python code is markedly more efficient than its Java code). These findings highlight the need for research into optimization oriented methods to improve the efficiency of LLM generated code across diverse languages.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
The task of issue resolving aims to modify a codebase to generate a patch that addresses a given issue. However, most existing benchmarks focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across different programming languages. To bridge this gap, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering 8 languages of Python, Java, TypeScript, JavaScript, Go, Rust, C, and C++. In particular, this benchmark includes a total of 2,132 high-quality instances, carefully curated by 68 expert annotators, ensuring a reliable and accurate evaluation of LLMs on the issue-resolving task. Based on human-annotated results, the issues are further classified into three difficulty levels. We evaluate a series of state-of-the-art models on Multi-SWE-bench, utilizing both procedural and agent-based frameworks for issue resolving. Our experiments reveal three key findings: (1) Limited generalization across languages: While existing LLMs perform well on Python issues, their ability to generalize across other languages remains limited; (2) Performance aligned with human-annotated difficulty: LLM-based agents' performance closely aligns with human-assigned difficulty, with resolution rates decreasing as issue complexity rises; and (3) Performance drop on cross-file issues: The performance of current methods significantly deteriorates when handling cross-file issues. These findings highlight the limitations of current LLMs and underscore the need for more robust models capable of handling a broader range of programming languages and complex issue scenarios.
Scientists sacrifice delicious opossums to fight Florida's invasive pythons
Environment Conservation Land Scientists sacrifice delicious opossums to fight Florida's invasive pythons More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. Tracking them during digestion may help curb the snake population. Breakthroughs, discoveries, and DIY tips sent six days a week. Some of Florida's opossums may soon start dying for a noble cause. A few select marsupials fitted with tracking collars may begin to lead scientists to invasive Burmese pythons () slithering through the Everglades.
Automatic differentiation in ML: Where we are and where we should be going
Bart van Merrienboer, Olivier Breuleux, Arnaud Bergeron, Pascal Lamblin
We review the current state of automatic differentiation (AD) for array programming in machine learning (ML), including the different approaches such as operator overloading (OO) and source transformation (ST) used for AD, graph-based intermediate representations for programs, and source languages. Based on these insights, we introduce a new graph-based intermediate representation (IR) which specifically aims to efficiently support fully-general AD for array programming. Unlike existing dataflow programming representations in ML frameworks, our IR naturally supports function calls, higher-order functions and recursion, making ML models easier to implement. The ability to represent closures allows us to perform AD using ST without a tape, making the resulting derivative (adjoint) program amenable to ahead-of-time optimization using tools from functional language compilers, and enabling higher-order derivatives. Lastly, we introduce a proof of concept compiler toolchain called Myia which uses a subset of Python as a front end.
Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming
Bart van Merrienboer, Dan Moldovan, Alexander Wiltschko
The need to efficiently calculate first-and higher-order derivatives of increasingly complex models expressed in Python has stressed or exceeded the capabilities of available tools. In this work, we explore techniques from the field of automatic differentiation (AD) that can give researchers expressive power, performance and strong usability. These include source-code transformation (SCT), flexible gradient surgery, efficient in-place array operations, and higher-order derivatives. We implement and demonstrate these ideas in the Tangent software library for Python, the first AD framework for a dynamic language that uses SCT.
Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming
The need to efficiently calculate first-and higher-order derivatives of increasingly complex models expressed in Python has stressed or exceeded the capabilities of available tools. In this work, we explore techniques from the field of automatic differentiation (AD) that can give researchers expressive power, performance and strong usability. These include source-code transformation (SCT), flexible gradient surgery, efficient in-place array operations, and higher-order derivatives. We implement and demonstrate these ideas in the Tangent software library for Python, the first AD framework for a dynamic language that uses SCT.
Longest snake ever measured is over 23.5 feet long
Environment Animals Wildlife Endangered Species Longest snake ever measured is over 23.5 feet long Nicknamed the'Baroness,' this python is longer than two great white sharks. The Baroness may be as much as 10 percent longer than initially measured. Breakthroughs, discoveries, and DIY tips sent six days a week. A snake in southwest Indonesia has shattered the Guinness World Record for the longest serpent ever spotted in the wild. Nicknamed "Ibu Baron" (the Baroness), the giant female reticulated python () discovered in late 2025 measures 23-feet-and-8-inches from head to tail--about the same length as a regulation soccer goal.