Accuracy
An Empirical Study into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation
Lampert, Thomas A., Stumpf, André, Gançarski, Pierre
Although agreement between annotators has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of computer vision (CV) object detection algorithms. Many researchers utilise ground truth (GT) in experiments and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affect an algorithm's evaluation? Four examples of typical CV problems are chosen, and a methodology is applied to each to quantify the inter-annotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear objects annotator agreement is very low. The agreement in object position, linear or otherwise, can be partially explained through basic image properties. Automatic object detectors are compared to annotator agreement and it is found that a clear relationship exists. Several methods for calculating GTs from a number of annotations are applied and the resulting differences in the performance of the object detectors are quantified. It is found that the rank of a detector is highly dependent upon the method used to form the GT. It is also found that although the STAPLE and LSML GT estimation methods appear to represent the mean of the performance measured using the individual annotations, when there are few annotations, or there is a large variance in them, these estimates tend to degrade. Furthermore, one of the most commonly adopted annotation combination methods--consensus voting--accentuates more obvious features, which results in an overestimation of the algorithm's performance. Finally, it is concluded that in some datasets it may not be possible to state with any confidence that one algorithm outperforms another when evaluating upon one GT and a method for calculating confidence bounds is discussed.
Algorithm learns to identify anomalous activity online with high degree of accuracy - The Tartan
At the IEEE International Conference on Big Data Security in New York City this month, researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the machine learning start-up PatternEx, presented a paper about their new security system that combines machine learning approaches and input from human security experts. This system, called AI2 (named by merging "artificial intelligence" and "analyst intuition"), has an 85 percent success rate in identifying threats and a false positive rate of 4.4 percent over a raw data set of 3.6 billion log lines. According to the paper, the three major challenges faced by the security industry are a lack of labelled examples to model learning models on, constant evolution of attacker's methods, and limited reliance on security analysts to determine each threat's risk factor. In fact, stand-alone analyst-driven approaches are limited in their effectiveness because of the fact that attackers learn the behavior used by such systems to predict possible threats, and then work their way around that learned behavior in order to bypass security systems. Furthermore, only machine learning-based approaches can be inefficient based on the fact that they raise a need for human investigation every time they come across an anomaly.
Learning Concept Graphs from Online Educational Data
Liu, Hanxiao, Ma, Wanli, Yang, Yiming, Carbonell, Jaime
This paper addresses an open challenge in educational data mining, i.e., the problem of automatically mapping online courses from different providers (universities, MOOCs, etc.) onto a universal space of concepts, and predicting latent prerequisite dependencies (directed links) among both concepts and courses. We propose a novel approach for inference within and across course-level and concept-level directed graphs. In the training phase, our system projects partially observed course-level prerequisite links onto directed concept-level links; in the testing phase, the induced concept-level links are used to infer the unknown course-level prerequisite links. Whereas courses may be specific to one institution, concepts are shared across different providers. The bi-directional mappings enable our system to perform interlingua-style transfer learning, e.g. treating the concept graph as the interlingua and transferring the prerequisite relations across universities via the interlingua. Experiments on our newly collected datasets of courses from MIT, Caltech, Princeton and CMU show promising results.
Efficient AUC Optimization for Information Ranking Applications
Adequate evaluation of an information retrieval system to estimate future performance is a crucial task. Area under the ROC curve (AUC) is widely used to evaluate the generalization of a retrieval system. However, the objective function optimized in many retrieval systems is the error rate and not the AUC value. This paper provides an efficient and effective non-linear approach to optimize AUC using additive regression trees, with a special emphasis on the use of multi-class AUC (MAUC) because multiple relevance levels are widely used in many ranking applications. Compared to a conventional linear approach, the performance of the non-linear approach is comparable on binary-relevance benchmark datasets and is better on multi-relevance benchmark datasets.
My thoughts on big data and data science: no, it's not hype
Each time a credit card is swiped or processed online, an analytic algorithm is used to detect if it's fraudulent or not (and the answer must come in less than 3 seconds most of the time, with low false negative rate). Each time you do a Google search, an analytic engine determines witch search results to show you, and which ads to display. Each time someone posts something on Facebook, an analytic algorithm is run to determine if it must be rejected (promotion, spam, porn etc) or not. Each Tweet posted is analyzed by analytic algorithms (designed by a number of various companies) to detect new viral trends (for journalists), or disease spread, intelligence leaks or many other things. Each time you browse Amazon, the customized content delivered to you is analytically "calculated" to optimize Amazon's revenue.
Machine learning and social engineering attacks
In my last post I promised to use some real-world use cases from the recent Verizon Data Breach Digest report to illustrate potential ways that machine learning be can used to detect or prevent similar incidents. For my first example, I've chosen the case of a manufacturer whose designs for an innovative new model of heavy construction equipment were stolen following a social engineering attack. They were tipped off when a primary competitor, located on another continent, introduced a new piece of equipment that looked like an exact copy of a model recently developed by the victim company. To paraphrase the Verizon report, it went like this. The threat actors identified an employee who they suspected would have access to new product design they were after -- the chief design engineer.
Developing an ICU scoring system with interaction terms using a genetic algorithm
Gan, Chee Chun, Learmonth, Gerard
ICU mortality scoring systems attempt to predict patient mortality using predictive models with various clinical predictors. Examples of such systems are APACHE, SAPS and MPM. However, most such scoring systems do not actively look for and include interaction terms, despite physicians intuitively taking such interactions into account when making a diagnosis. One barrier to including such terms in predictive models is the difficulty of using most variable selection methods in high-dimensional datasets. A genetic algorithm framework for variable selection with logistic regression models is used to search for two-way interaction terms in a clinical dataset of adult ICU patients, with separate models being built for each category of diagnosis upon admittance to the ICU. The models had good discrimination across all categories, with a weighted average AUC of 0.84 (>0.90 for several categories) and the genetic algorithm was able to find several significant interaction terms, which may be able to provide greater insight into mortality prediction for health practitioners. The GA selected models had improved performance against stepwise selection and random forest models, and provides greater flexibility in terms of variable selection by being able to optimize over any modeler-defined model performance metric instead of a specific variable importance metric.
Datacratic MLDB
By using machine learning algorithms, we are increasingly able to use computers to perform intellectual tasks at a level approaching that of humans. Given that computers cost less than employees, many people are afraid that humans will therefore necessarily lose their jobs to computers. Contrary to this belief, in this article I show that even when a computer can perform a task more economically than a human, careful analysis suggests that humans and computers working together can sometimes yield even better business outcomes than simply replacing one with the other. Specifically, I show how a classifier with a reject option can increase worker productivity for certain types of tasks, and I show how to construct and tune such a classifier from a simple scoring function by using two thresholds. I begin with a parable featuring the same characters as the one from Part 1 of this Machine Learning Meets Economics series.
How this AI-human partnership takes cybersecurity to a new level
In the ongoing battle against cyber attacks, a man-machine collaboration could offer a new path to security. To keep up with cyber threats, the cybersecurity industry has turned to assistance from unsupervised artificial intelligence systems that operate independently from human analysts. But the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology in Cambridge, Mass., in partnership with the machine-learning startup PatternEx, is offering a fresh approach. Their new program, AI2, draws on what humans and machines each do best: It allows human analysts to build upon the large scale pattern recognition and learning capabilities of artificial intelligence. The industry standard right now is unsupervised machine learning, CSAIL research scientist Kalyan Veeramachaneni, who helped develop the program, says in a phone interview with The Christian Science Monitor.
Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing data
The detection of rare variants is important for understanding the genetic heterogeneity in mixed samples. Recently, next-generation sequencing (NGS) technologies have enabled the identification of single nucleotide variants (SNVs) in mixed samples with high resolution. Yet, the noise inherent in the biological processes involved in next-generation sequencing necessitates the use of statistical methods to identify true rare variants. We propose a novel Bayesian statistical model and a variational expectation-maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of low coverage ($27\times$ and $298\times$) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants.