Dask and Pandas and XGBoost: Playing nicely between distributed systems


Editor's note: For an introduction to Dask, consider reading Introducing Dask for Parallel Programming: An Interview with Project Lead Developer. To read more about the most recent release, see Dask Release 0.14.1. This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training. More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them. XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees.