Prescient are the entrepreneurs who predicted data would become the new oil, like Ali Ghodsi, Andy Konwinski, Ion Stoica, Matei Zaharia, Patrick Wendell, Reynold Xin, and Scott Shenker. They're the cofounders of Databricks, a San Francisco-based company that provides a suite of enterprise-focused scalable data science and data engineering tools. Since 2013, the year Databricks opened for business, it's had no trouble attracting customers. But this week kicked into high gear the company's uninterrupted march toward market domination. Databricks this morning announced that it's closed a $400 million series F fundraising round led by Andreessen Horowitz with participation from Microsoft, Alkeon Capital Management, BlackRock, Coatue Management, Dragoneer Investment Group, Geodesic, Green Bay Ventures, New Enterprise Associates, T. Rowe Price, and Tiger Global Management.
Databricks, a leader in unified analytics and founded by the original creators of Apache Spark, and RStudio, today announced a new release of MLflow, an open source multi-cloud framework for the machine learning lifecycle, now with R integration. This new integration adds to features that have already been released, making MLflow the most comprehensive open source machine learning platform, with support for multiple programming languages, integrations with popular machine learning libraries, and support for multiple clouds. Previous to MLflow, the industry did not have a standard process or end-to-end infrastructure to develop and productionize machine learning applications in a simple and consistent way. With MLflow, organizations can package their code as reproducible runs, execute and compare hundreds of parallel experiments, leverage any hardware or software platform for training, tuning, hyperparameter search and more. Additionally, organizations can deploy and manage models in production on a variety of clouds and serving platforms.
Data scientists who work within the R environment can now partake of MLflow, the open source project that Databricks released earlier this year to help manage workflows associated with machine learning development and production lifecycles. In June, Databricks co-founder and CTO Matei Zaharia unveiled MLflow as a way to automate much of the work that data scientists do when building, testing, and deploying machine learning models. The open source software was designed to fill in the gaps between the various tools, frameworks, and processes when building machine learning systems, including tracking code, packaging models, and deploying them into production. According to Databricks, MLflow allows users to package their code as reproducible runs, execute and compare hundreds of parallel experiments, on any hardware or software platform, including on premise and cloud based environments. Assistance with hyperparameter tuning is also provided.
It shouldn't be surprising given the media spotlight on artificial intelligence, but AI will be all over the keynote and session schedule for this year's Spark Summit. The irony, of course, is that while Spark has become known as a workhorse for data engineering workloads, its original claim to fame was that it put machine learning on the same engine as SQL, streaming, and graph. But Spark has also had its share of impedance mismatch issues, such as making R and Python programs first-class citizens, or adapting to more compute-intensive processing of AI models. Of course, that hasn't stopped adventurous souls from breaking new ground. Hold those thoughts for a moment.
Databricks today unveiled MLflow, a new open source project that aims to provide some standardization to the complex processes that data scientists oversee during the course of building, testing, and deploying machine learning models. "Everybody who has done machine learning knows that the machine learning development lifecycle is very complex," Apache Spark creator and Databricks CTO Matei Zaharia said during his keynote address at Databricks' Spark and AI Summit in San Francisco. "There are a lot of issues that come up that you don't have in normal software development lifecycle." The vast volumes of data, together with the abundance of machine learning frameworks, the large scale of production systems, and the distributed nature of data science and engineering teams, combine to provide a huge number of variables to control in the machine learning DevOps lifecycle -- and that even before the tuning. "They have all these tuning parameters that you have to change and explore to get a good model," Zaharia said.