PySpark for Data Science Workflows
Demonstrated experience in PySpark is one of the most desirable competencies that employers are looking for when building data science teams, because it enables these teams to own live data products. While I've previously blogged about PySpark, Parallelization, and UDFs, I wanted to provide a proper overview of this topic as a book chapter. I'm sharing this complete chapter, because I want to encourage the adoption of PySpark as a tool for data scientists. All code examples from this post are available here, and all prerequisites are covered in the sample chapters here. You might want to grab some snacks before diving in! Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated, and fault-tolerant process. In more recent versions of Spark, the Dataframe API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface to Spark, and it provides an API for working with large-scale datasets in a distributed computing environment. PySpark is an extremely valuable tool for data scientists, because it can streamline the process for translating prototype models into production-grade model workflows. At Zynga, our data science team owns a number of production-grade systems that provide useful signals to our game and marketing teams. By using PySpark, we've been able to reduce the amount of support we need from engineering teams to scale up models from concept to production.
Dec-9-2019, 06:03:51 GMT