Editor's note: For an introduction to Dask, consider reading Introducing Dask for Parallel Programming: An Interview with Project Lead Developer. To read more about the most recent release, see Dask Release 0.14.1. This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training. More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them. XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees.
Nvidia, together with partners like IBM, HPE, Oracle, Databricks and others, is launching a new open-source platform for data science and machine learning today. Rapids, as the company is calling it, is all about making it easier for large businesses to use the power of GPUs to quickly analyze massive amounts of data and then use that to build machine learning models. "Businesses are increasingly data-driven," Nvidia's VP of Accelerated Computing Ian Buck told me. "They sense the market and the environment and the behavior and operations of their business through the data they've collected. We've just come through a decade of big data and the output of that data is using analytics and AI. But most it is still using traditional machine learning to recognize complex patterns, detect changes and make predictions that directly impact their bottom line."
We will now have a look at some simple cases for creating arrays using Dask. As you can see here, I had 11 values in the array and I used the chunk size as 5. This distributed my array into three chunks, where the first and second blocks have 5 values each and the third one has 1 value. Dask arrays support most of the numpy functions. For instance, you can use .sum()
NEWSBYTE IBM has announced a new partnership with AI and GPU hardware giant Nvidia, bringing the latter's Rapids open source data science toolkit into IBM's data science platform for on-premise, hybrid, and multi-cloud environments. Rapids will bring GPU acceleration capabilities to IBM's offerings, taking advantage of an ecosystem that includes the Web-based big data platform, Anaconda (an open source distribution of the Python and R programming languages for data science and machine learning), Apache Arrow, Pandas, and scikit-learn. Rapids is also supported by open-source contributors, including BlazingDB, Graphistry, NERSC, PyData, INRIA, and Ursa Labs. IBM's Power 9 with PowerAI environment will be among those benefiting from the tie-up. It will use Rapids to expand the options available to data scientists with new open-source machine learning and analytics libraries.
No matter the industry, data science has become a universal toolkit for businesses. Data analytics and machine learning give organizations insights and answers that shape their day-to-day actions and future plans. Being data-driven has become essential to lead any industry. While the world's data doubles each year, CPU computing has hit a brick wall with the end of Moore's law. For this reason, scientific computing and deep learning have turned to NVIDIA GPU acceleration.