Prediction at Scale with scikit-learn and PySpark Pandas UDFs

#artificialintelligence 

A common predictive modeling scenario, at least at Civis, is having a small or medium amount of labeled data to estimate a model from (e.g., 10,000 records), but a much larger unlabeled dataset to make predictions about. In this scenario, one might want to train a model on a laptop or single server with scikit-learn for ease of use and flexibility, but then apply that model to the large unlabeled dataset more quickly by distributing the computation with PySpark. Using PySpark for distributed prediction might also make sense if your ETL task is already implemented with (or would benefit from being implemented with) PySpark, which is wonderful for data transformations and ETL. PySpark has functionality to pickle python objects, including functions, and have them applied to data that is distributed across processes, machines, etc. Also, it has a pandas-like syntax but separates the definition of the computation from its execution, similar to TensorFlow.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found