A common need when you are analyzing real-world data-sets is determining which data point stand out as being different to all others data points. Such data points are known as anomalies. This article was originally published on Medium by Davis David. In this article, you will learn a couple of Machine Learning-Based Approaches for Anomaly Detection and then show how to apply one of these approaches to solve a specific use case for anomaly detection (Credit Fraud detection) in part two. A common need when you analyzing real-world data-sets is determining which data point stand out as being different to all others data points.
Regardless of the metric you decide to optimize, it's necessary to establish a baseline measure of performance. This baseline provides a point of comparison that enables you to track your progress. It also allows you to judge the rate of return you'll get by increasing the complexity of your modeling solution. Suppose you work for a real estate firm and are asked to build a model to predict the price of a house. You decide to optimize for RMSE and build a linear regression model with features including the square footage of the house, the number of bedrooms and bathrooms, and other information.
Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is model-free, data-driven. Some are easy to implement even in Excel.
The purpose of this retrospective study is to measure machine learning models' ability to predict glaucoma drainage device failure based on demographic information and preoperative measurements. The medical records of sixty-two patients were used. Potential predictors included the patient's race, age, sex, preoperative intraocular pressure, preoperative visual acuity, number of intraocular pressure-lowering medications, and number and type of previous ophthalmic surgeries. Failure was defined as final intraocular pressure greater than 18 mm Hg, reduction in intraocular pressure less than 20% from baseline, or need for reoperation unrelated to normal implant maintenance. Five classifiers were compared: logistic regression, artificial neural network, random forest, decision tree, and support vector machine.
Abstract: Gaussian processes (GPs) are flexible models with state-of-the-art performance on many impactful applications. However, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points in 3 days using 8 GPUs and can compute predictive means and variances in under a second using 1 GPU at test time. Moreover, we perform the first-ever comparison of exact GPs against state-of-the-art scalable approximations on large-scale regression datasets with $104-106$ data points, showing dramatic performance improvements.
Google Assistant can draw on voice command, as seen here at the Google I/O conference in 2018, with the help of machine learning techniques. Artificial intelligence systems powered by machine learning have been creating headlines with applications as varied as making restaurant reservations by phone, sorting cucumbers, and distinguishing chihuahuas from muffins. Media buzz aside, many fast-growing startups are taking advantage of machine learning (ML) techniques like neural networks and support vector machines to learn from data, make predictions, improve products, and enhance business decisions. Unfortunately "machine learning theater" – companies pretending to use the technology to make theirs seem more sophisticated for a higher valuation – is also on the rise. Undeniably, ML is transforming businesses and industries, with some more likely to benefit than others.
At HomeAway, we use Apache Kafka as the backbone for our streaming architecture. We also like to deploy machine learning models to make realtime predictions on our data streams. Confluent KSQL provides an easy to use and interactive SQL interface for performing stream processing on Kafka. Below we show how to build a model in Python and use the model in KSQL to make predictions based on a stream of data in Kafka. We use Predictive Model Markup Language (PMML) to enable the ability to train the model using the Python library Scikit-learn, but perform model inference in Java-based KSQL.
The seed for this article was planted when Anant was struck by a headline on his Twitter feed: "You don't need ML/AI. He had observed something similar in working through data and analytics requirements for Google Cloud's Apigee team -- not that machine learning (ML) or artificial intelligence (AI) is not needed, but that good database queries can frequently accomplish the job, and that when AI is legitimately needed, its role is often to improve the database design and operations, not to replace them. The two of us got the chance to compile our thinking a bit more as Anant was preparing for a talk at VLDB 2018, a premier database conference. The slides of his talk are here. In this post, we elaborate on some of our observations on the topic.
Today's post is based on a project I recently did in work. I was really excited to implement it and to write it up as a blog post as it gave me a chance to do some data engineering and also do something that was quite valuable for my team. Not too long ago, I discovered that we had a relatively large amount of user log data relating to one of our data products stored on our systems. Remember that a blockchain is an immutable, sequential chain of records called Blocks. They can contain transactions, files or any data you like, really.
Logistic regression was once the most popular machine learning algorithm, but the advent of more accurate algorithms for classification such as support vector machines, random forest, and neural networks has induced some machine learning engineers to view logistic regression as obsolete. Though it may have been overshadowed by more advanced methods, its simplicity makes it the ideal algorithm to use as an introduction to the study of machine learning. Like most machine learning algorithms, logistic regression creates a boundary edge between binary labels. The purpose of a training process is to place this edge in such a way that most of the labels are divided so as to maximize the accuracy of predictions. The training process requires correct model architecture and fine-tuned hyperparameters, whereas data play the most significant role in determining the prediction accuracy.