Data scientists who work within the R environment can now partake of MLflow, the open source project that Databricks released earlier this year to help manage workflows associated with machine learning development and production lifecycles. In June, Databricks co-founder and CTO Matei Zaharia unveiled MLflow as a way to automate much of the work that data scientists do when building, testing, and deploying machine learning models. The open source software was designed to fill in the gaps between the various tools, frameworks, and processes when building machine learning systems, including tracking code, packaging models, and deploying them into production. According to Databricks, MLflow allows users to package their code as reproducible runs, execute and compare hundreds of parallel experiments, on any hardware or software platform, including on premise and cloud based environments. Assistance with hyperparameter tuning is also provided.
MLflow, the open source machine learning operations (MLOps) platform created by Databricks, is becoming a Linux Foundation project. The move was announced by Matei Zaharia, co-founder of Databricks, and creator of both MLflow and Apache Spark, at the company's Spark AI Summit virtual event today. In a pre-briefing with ZDNet earlier in the week, Zaharia provided an update on MLflow's momentum, details on the new features and reasoning for moving management of the open source project from Databricks to the Linux Foundation. Momentum-wise, Zaharia said MLflow has been experiencing a 4x year-over-year growth rate. On the Databricks platform alone (including both the Amazon Web Services and Microsoft Azure offerings of the service), Zaharia said the more than 1M experiment runs are run on MLflow, and more than 100,000 ML models are added to its model registry, *each week*.
On July 9th, our team hosted a live webinar--Scalable End-to-End Deep Learning using TensorFlow and Databricks--with Brooke Wenig, Data Science Solutions Consultant at Databricks and Sid Murching, Software Engineer at Databricks. In this webinar, we walked you through how to use TensorFlow and Horovod (an open-source library from Uber to simplify distributed model training) on the Databricks Unified Analytics Platform to build a more effective recommendation system at scale. If you missed the webinar, you can view it now as well download the slides here. If you'd like free access Databricks Unified Analytics Platform and try our notebooks on it, you can access a free trial here. Toward the end, we held a Q&A, and below are all the questions and their answers.
Microsoft has been serious about helping data scientists track and manage their machine learning experiments for some time now. For example, the company's Azure Machine Learning (Azure ML) cloud service has supported the logging of experiments, including iterative runs with varying algorithms, hyperparameter values, or both. While Azure ML has had its own framework for such experiment monitoring and tracking, at last year's Spark AI Summit, its partner Databricks launched the open source MLflow project for handling similar tasks. MLflow is designed to work from most any environment, including the command line, notebooks and more, and its popularity has grown impressively over the last year, ostensibly as a result of that open orientation. Microsoft and Databricks are close partners, and MLflow is natively supported in Azure Databricks.
Databricks, the company behind the commercial development of Apache Spark, is placing its machine learning lifecycle project MLflow under the stewardship of the Linux Foundation. MLflow provides a programmatic way to deal with all the pieces of a machine learning project through all its phases -- construction, training, fine-tuning, deployment, management, and revision. It tracks and manages the the datasets, model instances, model parameters, and algorithms used in machine learning projects, so they can be versioned, stored in a central repository, and repackaged easily for reuse by other data scientists. MLflow's source is already available under the Apache 2.0 license, so this isn't about open sourcing a previously proprietary project. Projects for managing entire machine learning pipelines have taken shape over the past couple of years, providing single overarching tools for governing what is typically a sprawling and complex process involving multiple moving parts.