Demonstrated experience in PySpark is one of the most desirable competencies that employers are looking for when building data science teams, because it enables these teams to own live data products. While I've previously blogged about PySpark, Parallelization, and UDFs, I wanted to provide a proper overview of this topic as a book chapter. I'm sharing this complete chapter, because I want to encourage the adoption of PySpark as a tool for data scientists. All code examples from this post are available here, and all prerequisites are covered in the sample chapters here. You might want to grab some snacks before diving in! Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated, and fault-tolerant process. In more recent versions of Spark, the Dataframe API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface to Spark, and it provides an API for working with large-scale datasets in a distributed computing environment. PySpark is an extremely valuable tool for data scientists, because it can streamline the process for translating prototype models into production-grade model workflows. At Zynga, our data science team owns a number of production-grade systems that provide useful signals to our game and marketing teams. By using PySpark, we've been able to reduce the amount of support we need from engineering teams to scale up models from concept to production.
As the release of Spark 2.0 finally came, the machine learning library of Spark has been changed from the mllib to ml. One of the biggest change in the new ml library is the introduction of so-called machine learning pipeline. It provides a high level abstraction of the machine learning flow and greatly simplified the creation of machine learning process. In this tutorial, we will walk through the steps on how to create a machine learning pipeline and also explain what is under the hood in the pipeline. In this tutorial, we will demonstrate the process to create a pipeline in Spark to predict airline flight delay.
Let's go through an example of Credit Risk for Bank Loans: Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. A possible decision tree for predicting Credit Risk is shown below. The feature questions are the nodes, and the answers "yes" or "no" are the branches in the tree to the child nodes. Our data is from the German Credit Data Set which classifies people described by a set of attributes as good or bad credit risks.