Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimiza…

#artificialintelligence 

Contributes Intel Apache Spark* Spark Users *Other names and brands may be claimed as the property of others 3. Sparse data is almost everywhere • Data Source: – Movie ratings – Purchase history • Feature engineering: – NLP: CountVectorizer, HashingTF – Categorical: OneHotEncoder – Image, video 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 Customers products Purchase History 4. Sparse data support in MLlib new DenseVector( values Array(1.0, Sparse data support in MLlib • Supporting Sparse data since v1.0 – Load / Save, Sparse Vector, LIBSVM – Supporting sparse vector is one of the primary review focus. KMeans • Pick initial cluster centers – Random – KMeans • Iterative training – Points clustering, find nearest center for each point – Re-compute center in each cluster (avg.) MLlib iteration 2. Compute a sum table for each partition of data val sum new Array[Vector](k) for (each point in the partition) { val bestCenter traverse() sum(bestCenter) point } Training dataset Executor 1 Executor 2 Executor 3 Sums: 16G Centers: 16G *Other names and brands may be claimed as the property of others 14. Analysis: Data • Are the cluster centers dense?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found