Abstract: Gaussian processes (GPs) are flexible models with state-of-the-art performance on many impactful applications. However, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points in 3 days using 8 GPUs and can compute predictive means and variances in under a second using 1 GPU at test time. Moreover, we perform the first-ever comparison of exact GPs against state-of-the-art scalable approximations on large-scale regression datasets with $104-106$ data points, showing dramatic performance improvements.
Google Assistant can draw on voice command, as seen here at the Google I/O conference in 2018, with the help of machine learning techniques. Artificial intelligence systems powered by machine learning have been creating headlines with applications as varied as making restaurant reservations by phone, sorting cucumbers, and distinguishing chihuahuas from muffins. Media buzz aside, many fast-growing startups are taking advantage of machine learning (ML) techniques like neural networks and support vector machines to learn from data, make predictions, improve products, and enhance business decisions. Unfortunately "machine learning theater" – companies pretending to use the technology to make theirs seem more sophisticated for a higher valuation – is also on the rise. Undeniably, ML is transforming businesses and industries, with some more likely to benefit than others.
At HomeAway, we use Apache Kafka as the backbone for our streaming architecture. We also like to deploy machine learning models to make realtime predictions on our data streams. Confluent KSQL provides an easy to use and interactive SQL interface for performing stream processing on Kafka. Below we show how to build a model in Python and use the model in KSQL to make predictions based on a stream of data in Kafka. We use Predictive Model Markup Language (PMML) to enable the ability to train the model using the Python library Scikit-learn, but perform model inference in Java-based KSQL.
The seed for this article was planted when Anant was struck by a headline on his Twitter feed: "You don't need ML/AI. He had observed something similar in working through data and analytics requirements for Google Cloud's Apigee team -- not that machine learning (ML) or artificial intelligence (AI) is not needed, but that good database queries can frequently accomplish the job, and that when AI is legitimately needed, its role is often to improve the database design and operations, not to replace them. The two of us got the chance to compile our thinking a bit more as Anant was preparing for a talk at VLDB 2018, a premier database conference. The slides of his talk are here. In this post, we elaborate on some of our observations on the topic.
Today's post is based on a project I recently did in work. I was really excited to implement it and to write it up as a blog post as it gave me a chance to do some data engineering and also do something that was quite valuable for my team. Not too long ago, I discovered that we had a relatively large amount of user log data relating to one of our data products stored on our systems. Remember that a blockchain is an immutable, sequential chain of records called Blocks. They can contain transactions, files or any data you like, really.
Logistic regression was once the most popular machine learning algorithm, but the advent of more accurate algorithms for classification such as support vector machines, random forest, and neural networks has induced some machine learning engineers to view logistic regression as obsolete. Though it may have been overshadowed by more advanced methods, its simplicity makes it the ideal algorithm to use as an introduction to the study of machine learning. Like most machine learning algorithms, logistic regression creates a boundary edge between binary labels. The purpose of a training process is to place this edge in such a way that most of the labels are divided so as to maximize the accuracy of predictions. The training process requires correct model architecture and fine-tuned hyperparameters, whereas data play the most significant role in determining the prediction accuracy.
M3 is a deep learning system that infers demographic attributes directly from social media profiles--no further data is needed. This web demo showcases M3 on Twitter profiles, but M3 works on any similar profile data, in 32 languages. To learn more, please see our open-source Python library m3inference or read our Web Conference (WWW) 2019 paper for details. The paper also includes fully interpretable multilevel regression methods that estimate inclusion probabilities using the inferred demographic attributes to correct for sampling biases on social media platforms. This web demo was created by Scott Hale and Graham McNeill.
In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this'curse of dimensionality' problem. In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don't have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well. For this article, we will use the "EEG Brainwave Dataset" from Kaggle.
As a teacher of Data Science (Data Science for Internet of Things course at the University of Oxford), I am always fascinated in cross connection between concepts. To recap, Logistic regression is a binary classification method. It can be modelled as a function that can take in any number of inputs and constrain the output to be between 0 and 1. This means, we can think of Logistic Regression as a one-layer neural network. I hope you found this analysis useful as well.
But there are other tools that also claim to make machine learning easier and speed model development. I am wondering how they compare? So, this week, I am taking a look at Amazon SageMaker (SageMaker) and how it compares to Studio. What I found when I looked at SageMaker in comparison to Studio is a significantly different approach to model building. The vendors of each tool would both claim to offer a fully managed service that covers the entire machine learning workflow to build, train, and deploy machine learning models quickly.