PySpark for Data Science Workflows

Dec-9-2019, 06:03:51 GMT–#artificialintelligence

Demonstrated experience in PySpark is one of the most desirable competencies that employers are looking for when building data science teams, because it enables these teams to own live data products. While I've previously blogged about PySpark, Parallelization, and UDFs, I wanted to provide a proper overview of this topic as a book chapter. I'm sharing this complete chapter, because I want to encourage the adoption of PySpark as a tool for data scientists. All code examples from this post are available here, and all prerequisites are covered in the sample chapters here. You might want to grab some snacks before diving in! Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated, and fault-tolerant process. In more recent versions of Spark, the Dataframe API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface to Spark, and it provides an API for working with large-scale datasets in a distributed computing environment. PySpark is an extremely valuable tool for data scientists, because it can streamline the process for translating prototype models into production-grade model workflows. At Zynga, our data science team owns a number of production-grade systems that provide useful signals to our game and marketing teams. By using PySpark, we've been able to reduce the amount of support we need from engineering teams to scale up models from concept to production.

artificial intelligence, data mining, machine learning, (19 more...)

#artificialintelligence

Dec-9-2019, 06:03:51 GMT

News Web Page

Add feedback

Genre:
- Workflow (1.00)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (0.87)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning > Regression (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found