Accuracy
Dissimilarity-based Sparse Subset Selection
Elhamifar, Ehsan, Sapiro, Guillermo, Sastry, S. Shankar
Finding an informative subset of a large collection of data points or models is at the center of many problems in computer vision, recommender systems, bio/health informatics as well as image and natural language processing. Given pairwise dissimilarities between the elements of a `source set' and a `target set,' we consider the problem of finding a subset of the source set, called representatives or exemplars, that can efficiently describe the target set. We formulate the problem as a row-sparsity regularized trace minimization problem. Since the proposed formulation is, in general, NP-hard, we consider a convex relaxation. The solution of our optimization finds representatives and the assignment of each element of the target set to each representative, hence, obtaining a clustering. We analyze the solution of our proposed optimization as a function of the regularization parameter. We show that when the two sets jointly partition into multiple groups, our algorithm finds representatives from all groups and reveals clustering of the sets. In addition, we show that the proposed framework can effectively deal with outliers. Our algorithm works with arbitrary dissimilarities, which can be asymmetric or violate the triangle inequality. To efficiently implement our algorithm, we consider an Alternating Direction Method of Multipliers (ADMM) framework, which results in quadratic complexity in the problem size. We show that the ADMM implementation allows to parallelize the algorithm, hence further reducing the computational time. Finally, by experiments on real-world datasets, we show that our proposed algorithm improves the state of the art on the two problems of scene categorization using representative images and time-series modeling and segmentation using representative~models.
Predicting Flights Delay Using Supervised Learning, Logistic Regression
In this post, we'll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy. Note: This is a common data set in the machine learning community to test out algorithms and models given it's publicly available and have sizable data. In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
Razzaghi, Talayeh, Roderick, Oleg, Safro, Ilya, Marko, Nicholas
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.
Fanguard: Catching Star Wars surprises and other spoilers with Machine Learning
Ruth Toner Data Scientist Twitch Insight Fellow 2016 Physics Postdoc Harvard University Ruth Toner was a Fellow in our most recent Data Science session in Silicon Valley. She's since joined the Community team at Twitch as a Data Scientist. In this post she describes Fanguard, the tool she built at Insight to protect Tumblr readers from spoilers for blockbuster movies and popular TV shows. Before attending Insight Data Science, I spent eight years of my life in the field of particle physics. Like many postdocs and grad students, when I wasn't trying to discover the basic laws of matter (i.e., debugging my code), I spent a lot of time surfing the Internet.
A U-statistic Approach to Hypothesis Testing for Structure Discovery in Undirected Graphical Models
Bounliphone, Wacha, Blaschko, Matthew
Structure discovery in graphical models is the determination of the topology of a graph that encodes conditional independence properties of the joint distribution of all variables in the model. For some class of probability distributions, an edge between two variables is present if and only if the corresponding entry in the precision matrix is non-zero. For a finite sample estimate of the precision matrix, entries close to zero may be due to low sample effects, or due to an actual association between variables; these two cases are not readily distinguishable. %Fisher provided a hypothesis test based on a parametric approximation to the distribution of an entry in the precision matrix of a Gaussian distribution, but this may not provide valid upper bounds on $p$-values for non-Gaussian distributions. Many related works on this topic consider potentially restrictive distributional or sparsity assumptions that may not apply to a data sample of interest, and direct estimation of the uncertainty of an estimate of the precision matrix for general distributions remains challenging. Consequently, we make use of results for $U$-statistics and apply them to the covariance matrix. By probabilistically bounding the distortion of the covariance matrix, we can apply Weyl's theorem to bound the distortion of the precision matrix, yielding a conservative, but sound test threshold for a much wider class of distributions than considered in previous works. The resulting test enables one to answer with statistical significance whether an edge is present in the graph, and convergence results are known for a wide range of distributions. The computational complexities is linear in the sample size enabling the application of the test to large data samples for which computation time becomes a limiting factor. We experimentally validate the correctness and scalability of the test on multivariate distributions for which the distributional assumptions of competing tests result in underestimates of the false positive ratio. By contrast, the proposed test remains sound, promising to be a useful tool for hypothesis testing for diverse real-world problems.
Kaggle Ensembling Guide
Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions. For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending. I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself. This is how you win ML competitions: you take other peoples' work and ensemble them together." The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods -- no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up. Let's see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions. During space missions it is very important that all signals are correctly relayed. A coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote. Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote. As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired. Suppose we have a test set of 10 samples. The ground truth is all positive ("1?):
A bit on the F1 score floor
At Strata Hadoop World "R Day" Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called "confusion matrix." We repeated our usual admonition to not use "accuracy" as a project goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn't what they really want). And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more). We surveyed over a dozen common measures the data scientist is expected to know. While this may seem complicated, this is much better than the traditions used when trying to estimate inter-observer or tagger agreement (where there are around 100 measures, many of which combine effect size and significance, and requires significant research to understand which measures are monotone related to each other; see: Warrens, M. (2008).
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition. Measuring perplexity and frame-level classification accuracy, kernel-based acoustic models are as effective as their DNN counterparts. However, on token-error-rates DNN models can be significantly better. We have discovered that this might be attributed to DNN's unique strength in reducing both the perplexity and the entropy of the predicted posterior probabilities. Motivated by our findings, we propose a new technique, entropy regularized perplexity, for model selection. This technique can noticeably improve the recognition performance of both types of models, and reduces the gap between them.
How to assess quality and correctness of classification models? Part 4 - ROC Curve
In this fourth part of the tutorial we will discuss the ROC curve. The ROC curve is one of the methods for visualizing classification quality, which shows the dependency between TPR (True Positive Rate) and FPR (False Positive Rate). The more convex the curve, the better the classifier. In the example below, the „green" classifier is better in area 1, and the „red" classifier is better in area 2. AUC 1 means a perfect classifier, AUC 0.5 is obtained for purely random classifiers. AUC 0.5 means the classifier performs wor
The Naive Bayes Classifier explained
Reading the academic literature Text Analytics seems difficult. However, applying it in practice has shown us that Text Classification is much easier than it looks. Most of the Classifiers consist of only a few lines of code.In this three-part blog series we will examine the three well-known Classifiers; the Naive Bayes, Maximum Entropy and Support Vector Machines. From the introductionary blog we know that the Naive Bayes Classifier is based on the bag-of-words model. With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list.