Goto

Collaborating Authors

 mllib


Enriching the Machine Learning Workloads in BigBench

arXiv.org Artificial Intelligence

In the era of Big Data and the growing support for Machine Learning, Deep Learning and Artificial Intelligence algorithms in the current software systems, there is an urgent need of standardized application benchmarks that stress test and evaluate these new technologies. Relying on the standardized BigBench (TPCx-BB) benchmark, this work enriches the improved BigBench V2 with three new workloads and expands the coverage of machine learning algorithms. Our workloads utilize multiple algorithms and compare different implementations for the same algorithm across several popular libraries like MLlib, SystemML, Scikit-learn and Pandas, demonstrating the relevance and usability of our benchmark extension.


Movie Recommendations with Spark Collaborative Filtering - KDnuggets

#artificialintelligence

Collaborative filtering (CF) based on the alternating least squares (ALS) technique is another algorithm used to generate recommendations. It produces automatic predictions (filtering) about the interests of a user by collecting preferences from many other users (collaborating). The underlying assumption of the CF approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than a randomly chosen person. This algorithm gained a lot of traction in the data science community after it was used by the team winner of the Netflix Prize. The CF algorithm has also been implemented in Spark MLlib with the aim of addressing fast execution on very large datasets.



Spark MLlib on AWS Glue

#artificialintelligence

AWS pushes Sagemaker as its machine learning platform. However, Spark's MLlib is a comprehensive library that runs distributed ML natively on AWS Glue -- and provides a viable alternative to their primary ML platform. One of the big benefits of Sagemaker is that it easily supports experimentation via its Jupyter Notebooks. But operationalising your Sagemaker ML can be difficult, particularly if you need to include ETL processing at the start of your pipeline. In this situation, Apache Spark's MLlib running on AWS Glue can be a good option -- by its very nature, it is immediately operationalised, integrated with ETL pre-processing and ready to be used in production for an end-to-end machine learning pipeline.


Machine learning with PySpark

#artificialintelligence

In this article, I am going to share a few machine learning work I have done in spark using PySpark. Machine Learning is one of the hot application of artificial intelligence (AI). AI is a much bigger ecosystem with many amazing applications. Machine learning in simple terms is the ability to automatically learn by the machine and improve from experience without explicitly programmed. The learning process starts with observation of data, then it finds the pattern in date and makes a better decision on learning from data.


14 open source tools to make the most of machine learning

#artificialintelligence

Spam filtering, face recognition, recommendation engines -- when you have a large data set on which you'd like to perform predictive analysis or pattern recognition, machine learning is the way to go. The proliferation of free open source software has made machine learning easier to implement both on single machines and at scale, and in most popular programming languages. These open source tools include libraries for the likes of Python, R, C, Java, Scala, Clojure, JavaScript, and Go. Apache Mahout provides a way to build environments for hosting machine learning applications that can be scaled quickly and efficiently to meet demand. Mahout works mainly with another well-known Apache project, Spark, and was originally devised to work with Hadoop for the sake of running distributed applications, but has been extended to work with other distributed back ends like Flink and H2O. Mahout uses a domain specific language in Scala.


14 open source tools to make the most of machine learning

#artificialintelligence

Spam filtering, face recognition, recommendation engines -- when you have a large data set on which you'd like to perform predictive analysis or pattern recognition, machine learning is the way to go. The proliferation of free open source software has made machine learning easier to implement both on single machines and at scale, and in most popular programming languages. These open source tools include libraries for the likes of Python, R, C, Java, Scala, Clojure, JavaScript, and Go. Apache Mahout provides a way to build environments for hosting machine learning applications that can be scaled quickly and efficiently to meet demand. Mahout works mainly with another well-known Apache project, Spark, and was originally devised to work with Hadoop for the sake of running distributed applications, but has been extended to work with other distributed back ends like Flink and H2O. Mahout uses a domain specific language in Scala.


Machine_Learning_with_Spark

#artificialintelligence

This is a comprehensive tutorial on using the Spark distributed machine learning framework to build a scalable ML data pipeline. I will cover the basic machine learning algorithms implemented in Spark MLlib library and through this tutorial, I will use the PySpark in python environment. Machine learning is getting popular in solving real-world problems in almost every business domain. It helps solve the problems using the data which is often unstructured, noisy, and in huge size. With the increase in data sizes and various sources of data, solving machine learning problems using standard techniques pose a big challenge.


Distributing the Singular Value Decomposition with Apache Spark

#artificialintelligence

The Singular Value Decomposition (SVD) is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear systems, least squares, and many others can be solved using the SVD. It is frequently used in statistics where it is related to principal component analysis (PCA) and to correspondence analysis, and in signal processing and pattern recognition. Another usage is latent semantic indexing in natural language processing. Decades ago, before the rise of distributed computing, computer scientists developed the single-core ARPACK package for computing the eigenvalue decomposition of a matrix.


The 3 Biggest Mistakes on Learning Data Science

#artificialintelligence

I've discussed parts of what I'm going to mention here in other articles, but now I want to give a few directions on what's not data science and how not to learn it. So let's start with the basics. Data science not just knowing some programming languages, math, statistics and have "domain knowledge". We've created a new field, or something like that. There's a lot of things to say and study in this field.