Goto

Collaborating Authors

 Performance Analysis


Data Science (Machine Learning) 101

#artificialintelligence

Date Science, or Machine Learning, is a scary topic. It's hard to know where to get started. It's hard to even find a good definition of what it does and what you have to do. As I've given a few ad hoc presentations on Machine Learning (and though focused on implementing it with Azure, the basics are applicable to other platforms) I thought I'd take my random notes and present them as a primer. You don't need to be a Rocket Scientist to get started, but having a basic understanding of Linear Algebra will be helpful.


Variational Inference for On-line Anomaly Detection in High-Dimensional Time Series

arXiv.org Machine Learning

Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.


Online Optimization Methods for the Quantification Problem

arXiv.org Machine Learning

The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.


Tuning-Free Heterogeneity Pursuit in Massive Networks

arXiv.org Machine Learning

Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the understanding of important differences among subpopulations of interest. In this paper, we exploit multiple networks with Gaussian graphs to encode the connectivity patterns of a large number of features on the subpopulations. To uncover the heterogeneity of these structures across subpopulations, we suggest a new framework of tuning-free heterogeneity pursuit (THP) via large-scale inference, where the number of networks is allowed to diverge. In particular, two new tests, the chi-based test and the linear functional-based test, are introduced and their asymptotic null distributions are established. Under mild regularity conditions, we establish that both tests are optimal in achieving the testable region boundary and the sample size requirement for the latter test is minimal. Both theoretical guarantees and the tuning-free feature stem from efficient multiple-network estimation by our newly suggested approach of heterogeneous group square-root Lasso (HGSL) for high-dimensional multi-response regression with heterogeneous noises. To solve this convex program, we further introduce a tuning-free algorithm that is scalable and enjoys provable convergence to the global optimum. Both computational and theoretical advantages of our procedure are elucidated through simulation and real data examples.


What happens when your search engine is first to know you have cancer

Washington Post - Technology News

This week researchers demonstrated that by analyzing a person's Web searches they could in some cases predict an upcoming diagnosis of pancreatic cancer. Unlike traditional medical professionals, they have the advantage of access to a trove of data that Microsoft collects through its search engine, Bing. The Microsoft researchers identified Web users who had recently searched for queries indicating they have pancreatic cancer, such as "I was told I have pancreatic cancer, what to expect," and then looked back months earlier to examine patterns in the symptoms that the users searched for. This included phrases such as "dark or tarry stool," "abdominal swelling," "dark urine" and "yellowing skin." From this analysis they realized trends in the queries of users who were soon to be diagnosed with pancreatic cancer, identifying 5 to 15 percent of cases with low false-positive rates.


How web search data might help diagnose serious illness earlier - Next at Microsoft

@machinelearnbot

Early diagnosis is key to gaining the upper hand against a wide range of diseases. Now Microsoft researchers are suggesting that records of the topics that people search for on the Internet could one day prove as useful as an X-ray or MRI in detecting some illnesses before it's too late. The potential of using engagement with search engines to predict an eventual diagnosis โ€“ and possibly buy critical time for a medical response -- is demonstrated in a new study by Microsoft researchers Eric Horvitz and Ryen White, along with former Microsoft intern and Columbia University doctoral candidate John Paparrizos. In a paper published Tuesday in the Journal of Oncology Practice, the trio detailed how they used anonymized Bing search logs to identify people whose queries provided strong evidence that they had recently been diagnosed with pancreatic cancer โ€“ a particularly deadly and fast-spreading cancer that is frequently caught too late to cure. Then they retroactively analyzed searches for symptoms of the disease over many months prior to identify patterns of queries most likely to signal an eventual diagnosis.


US Patent Application for Face Detection Using Machine Learning Patent Application (Application #20160140436 issued May 19, 2016) - Justia Patents Search

#artificialintelligence

This invention relates generally to image processing and, more particularly, to object detection using machine learning. Face detection systems perform image processing on digital images or video frames to automatically identify people. In one approach, face detection systems classify images into positive images that contain faces and negative images without any faces. Face detection systems may train neural network for detecting faces and separating the faces from backgrounds. By separating faces from backgrounds, face detection systems may determine whether images contain faces. A good face detection system should have a low rate of false positive detection (i.e., erroneously detecting a negative image as a positive image) and a high rate of true positive detection (i.e. Face detection remains challenging because the number of positive images and negative images available for training typically are not balanced. For example, there may be many more negative images than positive images, and the neural network may be trained in a biased manner with too many negative images. As a result, the neural network trained with the imbalance number of positive and negative samples may suffer from low accuracy in face detection with high false positive detection rate or low true positive detection rate. Face detection also remains challenging because facial appearance may be irregular with large variance. For example, faces may be deformed because of subjects having varying poses or expressions. In addition, faces may be deformed by external settings such as lighting conditions, occlusions, etc. As a result, neural network may fail to distinguish faces from backgrounds and cause a high false positive detection rate. Thus, there is a need for good approaches to accurate face detection and detection of other objects.


The preoccupation with test error in applied machine learning

#artificialintelligence

"Predictive accuracy on test sets is the criterion for how good the model is." The quote above may be one of the most important observations, from one of the most important papers, in data science. So forgive me because I am not worthy, but I propose a reinterpretation of this philosophy for the commercial practice of applied machine learning in 2016. The technology exists now, be it purchased or built in-house, to directly measure the monetary value that a machine model is generating. This monetary value should be the criterion for selecting and deploying a commercial machine learning model, not its performance on old, static test data sets. In the worst cases, I've seen organizations choose models purely based on hype, or the shiny appeal of novelty (often buttressed by a blog post or whitepaper with impressive test data performances).


Expected Similarity Estimation for Large-Scale Batch and Streaming Anomaly Detection

arXiv.org Artificial Intelligence

We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (EXPoSE), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with EXPoSE can be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, EXPoSE can make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.


Bootstrap and cross-validation for evaluating modelling strategies

#artificialintelligence

I've been re-reading Frank Harrell's Regression Modelling Strategies, a must read for anyone who ever fits a regression model, although be prepared - depending on your background, you might get 30 pages in and suddenly become convinced you've been doing nearly everything wrong before, which can be disturbing. I wanted to evaluate three simple modelling strategies in dealing with data with many variables. Using data with 54 variables on 1,785 area units from New Zealand's 2013 census, I'm looking to predict median income on the basis of the other 53 variables. The features are all continuous and are variables like "mean number of bedrooms", "proportion of individuals with no religion" and "proportion of individuals who are smokers". None of these is exactly what I would use for real, but they serve the purpose of setting up a competition of strategies that I can test with a variety of model validation techniques.