ml library
MLZero: A Multi-Agent System for End-to-end Machine Learning Automation
Fang, Haoyang, Han, Boran, Erickson, Nick, Zhang, Xiyuan, Zhou, Su, Dagar, Anirudh, Zhang, Jiani, Turkmen, Ali Caner, Hu, Cuixiong, Rangwala, Huzefa, Wu, Ying Nian, Wang, Bernie, Karypis, George
Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6\%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Instructional Material > Course Syllabus & Notes (0.92)
- Transportation > Air (1.00)
- Information Technology (1.00)
- Education (1.00)
- (4 more...)
Bridging the Language Gap: An Empirical Study of Bindings for Open Source Machine Learning Libraries Across Software Package Ecosystems
Open source machine learning (ML) libraries enable developers to integrate advanced ML functionality into their own applications. However, popular ML libraries, such as TensorFlow, are not available natively in all programming languages and software package ecosystems. Hence, developers who wish to use an ML library which is not available in their programming language or ecosystem of choice, may need to resort to using a so-called binding library (or binding). Bindings provide support across programming languages and package ecosystems for reusing a host library. For example, the Keras .NET binding provides support for the Keras library in the NuGet (.NET) ecosystem even though the Keras library was written in Python. In this paper, we collect 2,436 cross-ecosystem bindings for 546 ML libraries across 13 software package ecosystems by using an approach called BindFind, which can automatically identify bindings and link them to their host libraries. Furthermore, we conduct an in-depth study of 133 cross-ecosystem bindings and their development for 40 popular open source ML libraries. Our findings reveal that the majority of ML library bindings are maintained by the community, with npm being the most popular ecosystem for these bindings. Our study also indicates that most bindings cover only a limited range of the host library's releases, often experience considerable delays in supporting new releases, and have widespread technical lag. Our findings highlight key factors to consider for developers integrating bindings for ML libraries and open avenues for researchers to further investigate bindings in software package ecosystems.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (5 more...)
A Large-Scale Study of Model Integration in ML-Enabled Software Systems
Sens, Yorick, Knopp, Henriette, Peldszus, Sven, Berger, Thorsten
The rise of machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them, as well as best practices for integrating them, i.e., software architectures. In contrast, the development of ML artifacts, i.e. ML models, comes from data science and focuses on the ML models and their training data. However, to deliver value to end users, these ML models must be embedded in traditional software, often forming complex topologies. In fact, ML-enabled software can easily incorporate many different ML models. While the challenges and practices of building ML-enabled systems have been studied to some extent, beyond isolated examples, little is known about the characteristics of real-world ML-enabled systems. Properly embedding ML models in systems so that they can be easily maintained or reused is far from trivial. We need to improve our empirical understanding of such systems, which we address by presenting the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub. We classified and analyzed them to determine their characteristics, as well as their practices for reusing ML models and related code, and the architecture of these systems. Our findings provide practitioners and researchers with insight into practices for embedding and integrating ML models, bringing data science and software engineering closer together.
- Europe > Germany (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (8 more...)
What Kinds of Contracts Do ML APIs Need?
Khairunnesa, Samantha Syeda, Ahmed, Shibbir, Imtiaz, Sayem Mohammad, Rajan, Hridesh, Leavens, Gary T.
Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at earlier stages in the ML pipeline. We describe an empirical study of posts on Stack Overflow of the four most often-discussed ML libraries: TensorFlow, Scikit-learn, Keras, and PyTorch. For these libraries, our study extracted 413 informal (English) API specifications. We used these specifications to understand the following questions. What are the root causes and effects behind ML contract violations? Are there common patterns of ML contract violations? When does understanding ML contracts require an advanced level of ML software expertise? Could checking contracts at the API level help detect the violations in early ML pipeline stages? Our key findings are that the most commonly needed contracts for ML APIs are either checking constraints on single arguments of an API or on the order of API calls. The software engineering community could employ existing contract mining approaches to mine these contracts to promote an increased understanding of ML APIs. We also noted a need to combine behavioral and temporal contract mining approaches. We report on categories of required ML contracts, which may help designers of contract languages.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Iowa (0.04)
- North America > United States > Hawaii (0.04)
- Asia (0.04)
JoinBoost: Grow Trees Over Normalized Data Using Only SQL
Huang, Zezhou, Sen, Rathijit, Liu, Jiaxiang, Wu, Eugene
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the $Y$ variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
- Oceania > New Zealand > North Island > Waikato (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.68)
What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow
Ghadesi, Amin, Lamothe, Maxime, Li, Heng
Machine learning (ML), including deep learning, has recently gained tremendous popularity in a wide range of applications. However, like traditional software, ML applications are not immune to the bugs that result from programming errors. Explicit programming errors usually manifest through error messages and stack traces. These stack traces describe the chain of function calls that lead to an anomalous situation, or exception. Indeed, these exceptions may cross the entire software stack (including applications and libraries). Thus, studying the patterns in stack traces can help practitioners and researchers understand the causes of exceptions in ML applications and the challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and study 11,449 stack traces related to seven popular Python ML libraries. First, we observe that ML questions that contain stack traces gain more popularity than questions without stack traces; however, they are less likely to get accepted answers. Second, we observe that recurrent patterns exists in ML stack traces, even across different ML libraries, with a small portion of patterns covering many stack traces. Third, we derive five high-level categories and 25 low-level types from the stack trace patterns: most patterns are related to python basic syntax, model training, parallelization, data transformation, and subprocess invocation. Furthermore, the patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers on SO. Our findings provide insights for researchers, ML library providers, and ML application developers to improve the quality of ML libraries and their applications.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Materials > Metals & Mining (0.64)
- Information Technology (0.46)
- Machinery > Industrial Machinery (0.40)
Top Machine Learning Model Deployment Tools For 2022
In the field of technology, machine learning is nothing new. The capacity to automate channels and increase company process flexibility brought about a revolutionary change for numerous industries. The machine learning lifecycle governs many aspects of developing and deploying trained model APIs in the production environment. Model deployment, which differs from the creation of ML models in that it has a steeper learning curve for beginners, has proven to be one of the most significant challenges in data science. Model deployment refers to integrating a machine learning model that accepts an input and delivers an output to make helpful business decisions based on data into an already-existing production environment.
Kubeflow vs MLflow - Which MLOps tool should you use
MLOps has quickly become one of the most important components of data science, with the market expected to grow by almost $4 billion by 2025. It is already being leveraged heavily with companies like Amazon, Google, Microsoft, IBM, H2O, Domino, DataRobot and Grid.ai using MLOps for pipeline automation, monitoring, lifecycle management and governance. More and more MLOps tools are being developed to address different parts of the workflow, with two dominating the space, Kubeflow and MLflow. Given their open-sourced nature, Kubeflow and MLflow are both chosen by leading tech companies. However, their capabilities and offerings are quite different when compared. For example, while Kubeflow is pipeline focused, MLflow is experimentation based.
The 6 Python Machine Learning Tools Every Data Scientist Should Know About - KDnuggets
Machine learning is rapidly evolving and the crucial focus of the software development industry. The infusion of artificial intelligence with machine learning has been a game-changer. More and more businesses are focusing on wide-scale research and implementation of this domain. Machine learning provides enormous advantages. It can quickly identify patterns and trends and the concept of automation comes to reality through ML.
11 Automatic Machine Learning Frameworks in 2022
Machine learning is used in almost every sector, mostly in every industry such as Agriculture, finance, healthcare, and marketing. AutoML frameworks are a very important part of machine learning. An automatic machine learning framework can help a business scale its operations and maintain an efficient ML lifecycle. It also allows anyone to build machine learning models efficiently. Machine learning engineers and data scientists can accelerate ML development using AutoML frameworks. An automatic machine learning framework is an interface that allows developers, machine learning engineers, and data scientists to build and deploy their machine learning models efficiently.