Prediction at Scale with scikit-learn and PySpark Pandas UDFs

Oct-28-2018, 09:25:21 GMT–#artificialintelligence

A common predictive modeling scenario, at least at Civis, is having a small or medium amount of labeled data to estimate a model from (e.g., 10,000 records), but a much larger unlabeled dataset to make predictions about. In this scenario, one might want to train a model on a laptop or single server with scikit-learn for ease of use and flexibility, but then apply that model to the large unlabeled dataset more quickly by distributing the computation with PySpark. Using PySpark for distributed prediction might also make sense if your ETL task is already implemented with (or would benefit from being implemented with) PySpark, which is wonderful for data transformations and ETL. PySpark has functionality to pickle python objects, including functions, and have them applied to data that is distributed across processes, machines, etc. Also, it has a pandas-like syntax but separates the definition of the computation from its execution, similar to TensorFlow.

artificial intelligence, machine learning, udf, (15 more...)

#artificialintelligence

Oct-28-2018, 09:25:21 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found