Statistical Learning


A Plethora of Original, Not Well-Known Statistical Tests

#artificialintelligence

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is model-free, data-driven. Some are easy to implement even in Excel.


Predicting failures of Molteno and Baerveldt glaucoma drainage devices using machine learning models

#artificialintelligence

The purpose of this retrospective study is to measure machine learning models' ability to predict glaucoma drainage device failure based on demographic information and preoperative measurements. The medical records of sixty-two patients were used. Potential predictors included the patient's race, age, sex, preoperative intraocular pressure, preoperative visual acuity, number of intraocular pressure-lowering medications, and number and type of previous ophthalmic surgeries. Failure was defined as final intraocular pressure greater than 18 mm Hg, reduction in intraocular pressure less than 20% from baseline, or need for reoperation unrelated to normal implant maintenance. Five classifiers were compared: logistic regression, artificial neural network, random forest, decision tree, and support vector machine.


r/MachineLearning - [R] [1903.08114] Exact Gaussian Processes on a Million Data Points

#artificialintelligence

Abstract: Gaussian processes (GPs) are flexible models with state-of-the-art performance on many impactful applications. However, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points in 3 days using 8 GPUs and can compute predictive means and variances in under a second using 1 GPU at test time. Moreover, we perform the first-ever comparison of exact GPs against state-of-the-art scalable approximations on large-scale regression datasets with $104-106$ data points, showing dramatic performance improvements.


A Futuristic Reality: Harnessing The Power Of The Three Layers Of Machine Learning

#artificialintelligence

Google Assistant can draw on voice command, as seen here at the Google I/O conference in 2018, with the help of machine learning techniques. Artificial intelligence systems powered by machine learning have been creating headlines with applications as varied as making restaurant reservations by phone, sorting cucumbers, and distinguishing chihuahuas from muffins. Media buzz aside, many fast-growing startups are taking advantage of machine learning (ML) techniques like neural networks and support vector machines to learn from data, make predictions, improve products, and enhance business decisions. Unfortunately "machine learning theater" – companies pretending to use the technology to make theirs seem more sophisticated for a higher valuation – is also on the rise. Undeniably, ML is transforming businesses and industries, with some more likely to benefit than others.


ML in KSQL

#artificialintelligence

At HomeAway, we use Apache Kafka as the backbone for our streaming architecture. We also like to deploy machine learning models to make realtime predictions on our data streams. Confluent KSQL provides an easy to use and interactive SQL interface for performing stream processing on Kafka. Below we show how to build a model in Python and use the model in KSQL to make predictions based on a stream of data in Kafka. We use Predictive Model Markup Language (PMML) to enable the ability to train the model using the Python library Scikit-learn, but perform model inference in Java-based KSQL.


SQL vs. Machine Learning vs. Machine Learning Applied to SQL

#artificialintelligence

The seed for this article was planted when Anant was struck by a headline on his Twitter feed: "You don't need ML/AI. He had observed something similar in working through data and analytics requirements for Google Cloud's Apigee team -- not that machine learning (ML) or artificial intelligence (AI) is not needed, but that good database queries can frequently accomplish the job, and that when AI is legitimately needed, its role is often to improve the database design and operations, not to replace them. The two of us got the chance to compile our thinking a bit more as Anant was preparing for a talk at VLDB 2018, a premier database conference. The slides of his talk are here. In this post, we elaborate on some of our observations on the topic.


10 Amazing Articles On Python Programming And Machine Learning Week 3

#artificialintelligence

Today's post is based on a project I recently did in work. I was really excited to implement it and to write it up as a blog post as it gave me a chance to do some data engineering and also do something that was quite valuable for my team. Not too long ago, I discovered that we had a relatively large amount of user log data relating to one of our data products stored on our systems. Remember that a blockchain is an immutable, sequential chain of records called Blocks. They can contain transactions, files or any data you like, really.


Logistic Regression with Python

#artificialintelligence

Logistic regression was once the most popular machine learning algorithm, but the advent of more accurate algorithms for classification such as support vector machines, random forest, and neural networks has induced some machine learning engineers to view logistic regression as obsolete. Though it may have been overshadowed by more advanced methods, its simplicity makes it the ideal algorithm to use as an introduction to the study of machine learning. Like most machine learning algorithms, logistic regression creates a boundary edge between binary labels. The purpose of a training process is to place this edge in such a way that most of the labels are divided so as to maximize the accuracy of predictions. The training process requires correct model architecture and fine-tuned hyperparameters, whereas data play the most significant role in determining the prediction accuracy.


M3 Multimodal, Multiattribute, Multilingual Demo

#artificialintelligence

M3 is a deep learning system that infers demographic attributes directly from social media profiles--no further data is needed. This web demo showcases M3 on Twitter profiles, but M3 works on any similar profile data, in 32 languages. To learn more, please see our open-source Python library m3inference or read our Web Conference (WWW) 2019 paper for details. The paper also includes fully interpretable multilevel regression methods that estimate inclusion probabilities using the inferred demographic attributes to correct for sampling biases on social media platforms. This web demo was created by Scott Hale and Graham McNeill.


Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data

#artificialintelligence

In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this'curse of dimensionality' problem. In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don't have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well. For this article, we will use the "EEG Brainwave Dataset" from Kaggle.