Accuracy
How web search data might help diagnose serious illness earlier - Next at Microsoft
Early diagnosis is key to gaining the upper hand against a wide range of diseases. Now Microsoft researchers are suggesting that records of the topics that people search for on the Internet could one day prove as useful as an X-ray or MRI in detecting some illnesses before it's too late. The potential of using engagement with search engines to predict an eventual diagnosis – and possibly buy critical time for a medical response -- is demonstrated in a new study by Microsoft researchers Eric Horvitz and Ryen White, along with former Microsoft intern and Columbia University doctoral candidate John Paparrizos. In a paper published Tuesday in the Journal of Oncology Practice, the trio detailed how they used anonymized Bing search logs to identify people whose queries provided strong evidence that they had recently been diagnosed with pancreatic cancer – a particularly deadly and fast-spreading cancer that is frequently caught too late to cure. Then they retroactively analyzed searches for symptoms of the disease over many months prior to identify patterns of queries most likely to signal an eventual diagnosis.
US Patent Application for Face Detection Using Machine Learning Patent Application (Application #20160140436 issued May 19, 2016) - Justia Patents Search
This invention relates generally to image processing and, more particularly, to object detection using machine learning. Face detection systems perform image processing on digital images or video frames to automatically identify people. In one approach, face detection systems classify images into positive images that contain faces and negative images without any faces. Face detection systems may train neural network for detecting faces and separating the faces from backgrounds. By separating faces from backgrounds, face detection systems may determine whether images contain faces. A good face detection system should have a low rate of false positive detection (i.e., erroneously detecting a negative image as a positive image) and a high rate of true positive detection (i.e. Face detection remains challenging because the number of positive images and negative images available for training typically are not balanced. For example, there may be many more negative images than positive images, and the neural network may be trained in a biased manner with too many negative images. As a result, the neural network trained with the imbalance number of positive and negative samples may suffer from low accuracy in face detection with high false positive detection rate or low true positive detection rate. Face detection also remains challenging because facial appearance may be irregular with large variance. For example, faces may be deformed because of subjects having varying poses or expressions. In addition, faces may be deformed by external settings such as lighting conditions, occlusions, etc. As a result, neural network may fail to distinguish faces from backgrounds and cause a high false positive detection rate. Thus, there is a need for good approaches to accurate face detection and detection of other objects.
Expected Similarity Estimation for Large-Scale Batch and Streaming Anomaly Detection
Schneider, Markus, Ertel, Wolfgang, Ramos, Fabio
We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (EXPoSE), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with EXPoSE can be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, EXPoSE can make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.
Trend Filtering on Graphs
Wang, Yu-Xiang, Sharpnack, James, Smola, Alex, Tibshirani, Ryan J.
We introduce a family of adaptive estimators on graphs, based on penalizing the $\ell_1$ norm of discrete graph differences. This generalizes the idea of trend filtering [Kim et al. (2009), Tibshirani (2014)], used for univariate nonparametric regression, to graphs. Analogous to the univariate case, graph trend filtering exhibits a level of local adaptivity unmatched by the usual $\ell_2$-based graph smoothers. It is also defined by a convex minimization problem that is readily solved (e.g., by fast ADMM or Newton algorithms). We demonstrate the merits of graph trend filtering through examples and theory.
A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles
When random forests are used for binary classification, an ensemble of $t=1,2,\dots$ randomized classifiers is generated, and the predictions of the classifiers are aggregated by majority vote. Due to the randomness in the algorithm, there is a natural tradeoff between statistical performance and computational cost. On one hand, as $t$ increases, the (random) prediction error of the ensemble tends to decrease and stabilize. On the other hand, larger ensembles require greater computational cost for training and making new predictions. The present work offers a new approach for quantifying this tradeoff: Given a fixed training set $\mathcal{D}$, let the random variables $\text{Err}_{t,0}$ and $\text{Err}_{t,1}$ denote the class-wise prediction error rates of a randomly generated ensemble of size $t$. As $t\to\infty$, we provide a general bound on the "algorithmic variance", $\text{var}(\text{Err}_{t,l}|\mathcal{D})\leq \frac{f_l(1/2)^2}{4t}+o(\frac{1}{t})$, where $l\in\{0,1\}$, and $f_l$ is a density function that arises from the ensemble method. Conceptually, this result is somewhat surprising, because $\text{var}(\text{Err}_{t,l}|\mathcal{D})$ describes how $\text{Err}_{t,l}$ varies over repeated runs of the algorithm, and yet, the formula leads to a method for bounding $\text{var}(\text{Err}_{t,l}|\mathcal{D})$ with a single ensemble. The bound is also sharp in the sense that it is attained by an explicit family of randomized classifiers. With regard to the task of estimating $f_l(1/2)$, the presence of the ensemble leads to a unique twist on the classical setup of non-parametric density estimation --- wherein the effects of sample size and computational cost are intertwined. In particular, we propose an estimator for $f_l(1/2)$, and derive an upper bound on its MSE that matches "standard optimal non-parametric rates" when $t$ is sufficiently large.
Metrics To Evaluate Machine Learning Algorithms in Python - Machine Learning Mastery
The metrics that you choose to evaluate your machine learning algorithms are very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose. In this post you will discover how to select and use different machine learning performance metrics in Python with scikit-learn. Metrics To Evaluate Machine Learning Algorithms in Python Photo by Ferrous Büller, some rights reserved.
Machine Learning Has Transformed Many Aspects Of Everyday Life
For example, it is important to understand how the business will use the model's results. Typically, scores are combined with a single threshold to convert it into a decision procedure (i.e.: fast track applications with scores lower than certain level, assumed to be low risk). To do this, a balance between the true-positives (applications the model correctly classifies as high risk), false-positives (applications the model scores as high risk but are not) and the false-negatives (applications the model scores as low risk but were in fact high risk) is essential. I suggest using ROC curves, including the AUC (area under the curve) as a proxy measure for tuning scoring procedures until a good trade-off is found.
Singular ridge regression with homoscedastic residuals: generalization error with estimated parameters
Grigoryeva, Lyudmila, Ortega, Juan-Pablo
This paper characterizes the conditional distribution properties of the finite sample ridge regression estimator and uses that result to evaluate total regression and generalization errors that incorporate the inaccuracies committed at the time of parameter estimation. The paper provides explicit formulas for those errors. Unlike other classical references in this setup, our results take place in a fully singular setup that does not assume the existence of a solution for the non-regularized regression problem. In exchange, we invoke a conditional homoscedasticity hypothesis on the regularized regression residuals that is crucial in our developments.
A New Approach to Building the Interindustry Input--Output Table
We present a new approach to estimating the interdependence of industries in an economy by applying data science solutions. By exploiting interfirm buyer--seller network data, we show that the problem of estimating the interdependence of industries is similar to the problem of uncovering the latent block structure in network science literature. To estimate the underlying structure with greater accuracy, we propose an extension of the sparse block model that incorporates node textual information and an unbounded number of industries and interactions among them. The latter task is accomplished by extending the well-known Chinese restaurant process to two dimensions. Inference is based on collapsed Gibbs sampling, and the model is evaluated on both synthetic and real-world datasets. We show that the proposed model improves in predictive accuracy and successfully provides a satisfactory solution to the motivated problem. We also discuss issues that affect the future performance of this approach.
District Data Labs - Visual Diagnostics for More Informed Machine Learning: Part 3
Note: Before starting Part 3, be sure to read Part 1 and Part 2! In this final installment of Visual Diagnostics for More Informed Machine Learning, we'll close the loop on visualization tools for navigating the different phases of the machine learning workflow. Recall that we are framing the workflow in terms of the'model selection triple' -- this includes analyzing and selecting features, experimenting with different model forms, and evaluating and tuning fitted models. So far, we've covered methods for visual feature analysis in Part 1 and methods for model family and form exploration in Part 2. This post will cover evaluation and tuning, so we'll begin with two questions: You've probably heard other machine learning practitioners talking about their F1 scores or their R-Squared value. Generally speaking, we do tend to rely on numeric scores to tell us when our models are performing well or poorly. There are a number of measures we can use to evaluate our fitted models.