XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. For information about installing XGBoost on Databricks Runtime, or installing a custom version on Databricks Runtime ML, see these instructions. You can train XGBoost models on an individual machine or in a distributed fashion.
Databricks today unveiled MLflow, a new open source project that aims to provide some standardization to the complex processes that data scientists oversee during the course of building, testing, and deploying machine learning models. "Everybody who has done machine learning knows that the machine learning development lifecycle is very complex," Apache Spark creator and Databricks CTO Matei Zaharia said during his keynote address at Databricks' Spark and AI Summit in San Francisco. "There are a lot of issues that come up that you don't have in normal software development lifecycle." The vast volumes of data, together with the abundance of machine learning frameworks, the large scale of production systems, and the distributed nature of data science and engineering teams, combine to provide a huge number of variables to control in the machine learning DevOps lifecycle -- and that even before the tuning. "They have all these tuning parameters that you have to change and explore to get a good model," Zaharia said.