Performance Analysis
Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging
Oakden-Rayner, Luke, Dunnmon, Jared, Carneiro, Gustavo, Ré, Christopher
Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model still consistently misses a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring and describing hidden stratification effects, and characterize these effects both on multiple medical imaging datasets and via synthetic experiments on the well-characterised CIFAR-100 benchmark dataset. We find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we explore the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.
Crowdsourcing via Pairwise Co-occurrences: Identifiability and Algorithms
Ibrahim, Shahana, Fu, Xiao, Kargas, Nikos, Huang, Kejun
The data deluge comes with high demands for data labeling. Crowdsourcing (or, more generally, ensemble learning) techniques aim to produce accurate labels via integrating noisy, non-expert labeling from annotators. The classic Dawid-Skene estimator and its accompanying expectation maximization (EM) algorithm have been widely used, but the theoretical properties are not fully understood. Tensor methods were proposed to guarantee identification of the Dawid-Skene model, but the sample complexity is a hurdle for applying such approaches---since the tensor methods hinge on the availability of third-order statistics that are hard to reliably estimate given limited data. In this paper, we propose a framework using pairwise co-occurrences of the annotator responses, which naturally admits lower sample complexity. We show that the approach can identify the Dawid-Skene model under realistic conditions. We propose an algebraic algorithm reminiscent of convex geometry-based structured matrix factorization to solve the model identification problem efficiently, and an identifiability-enhanced algorithm for handling more challenging and critical scenarios. Experiments show that the proposed algorithms outperform the state-of-art algorithms under a variety of scenarios.
Annotated Guidelines and Building Reference Corpus for Myanmar-English Word Alignment
Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar - English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines f or Myanmar - English word alignment annotation between two languages over contrastive learning and built the Myanmar - English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus conta ins confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual w ords. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores. A bilingual corpus aligned at the level of sentences or words is a precious resource for developing machine translation systems. Word alignment is a fundamental step in extracting translation information from bilingual corpus and determines which words and phrases are translations of each other in the original and translated sentence. In most translation systems, translational correspondences are rather complex; for a language pair such as Myanmar and Eng lish that belong to the different word order languages.
Offline identification of surgical deviations in laparoscopic rectopexy
Huaulmé, Arnaud, Voros, Sandrine, Reche, Fabian, Faucheron, Jean-Luc, Moreau-Gaudry, Alexandre, Jannin, Pierre
Objective: A median of 14.4% of patient undergone at least one adverse event during surgery and a third of them are preventable. The occurrence of adverse events forces surgeons to implement corrective strategies and, thus, deviate from the standard surgical process. Therefore, it is clear that the automatic identification of adverse events is a major challenge for patient safety. In this paper, we have proposed a method enabling us to identify such deviations. We have focused on identifying surgeons' deviations from standard surgical processes due to surgical events rather than anatomic specificities. This is particularly challenging, given the high variability in typical surgical procedure workflows. Methods: We have introduced a new approach designed to automatically detect and distinguish surgical process deviations based on multi-dimensional non-linear temporal scaling with a hidden semi-Markov model using manual annotation of surgical processes. The approach was then evaluated using cross-validation. Results: The best results have over 90% accuracy. Recall and precision were superior at 70%. We have provided a detailed analysis of the incorrectly-detected observations. Conclusion: Multi-dimensional non-linear temporal scaling with a hidden semi-Markov model provides promising results for detecting deviations. Our error analysis of the incorrectly-detected observations offers different leads in order to further improve our method. Significance: Our method demonstrated the feasibility of automatically detecting surgical deviations that could be implemented for both skill analysis and developing situation awareness-based computer-assisted surgical systems.
Online Semi-Supervised Concept Drift Detection with Density Estimation
Tan, Chang How, Lee, Vincent CS, Salehi, Mahsa
Concept drift is formally defined as the change in joint distribution of a set of input variables X and a target variable y. The two types of drift that are extensively studied are real drift and virtual drift where the former is the change in posterior probabilities p(y|X) while the latter is the change in distribution of X without affecting the posterior probabilities. Many approaches on concept drift detection either assume full availability of data labels, y or handle only the virtual drift. In a streaming environment, the assumption of full availability of data labels, y is questioned. On the other hand, approaches that deal with virtual drift failed to address real drift. Rather than improving the state-of-the-art methods, this paper presents a semi-supervised framework to deal with the challenges above. The objective of the proposed framework is to learn from streaming environment with limited data labels, y and detect real drift concurrently. This paper proposes a novel concept drift detection method utilizing the densities of posterior probabilities in partially labeled streaming environments. Experimental results on both synthetic and realworld datasets show that our proposed semi-supervised framework enables the detection of concept drift in such environment while achieving comparable prediction performance to the state-of-the-art methods.
Model-Agnostic Linear Competitors -- When Interpretable Models Compete and Collaborate with Black-Box Models
Rafique, Hassan, Wang, Tong, Lin, Qihang
Driven by an increasing need for model interpretability, interpretable models have become strong competitors for black-box models in many real applications. In this paper, we propose a novel type of model where interpretable models compete and collaborate with black-box models. We present the Model-Agnostic Linear Competitors (MALC) for partially interpretable classification. MALC is a hybrid model that uses linear models to locally substitute any black-box model, capturing subspaces that are most likely to be in a class while leaving the rest of the data to the black-box. MALC brings together the interpretable power of linear models and good predictive performance of a black-box model. We formulate the training of a MALC model as a convex optimization. The predictive accuracy and transparency (defined as the percentage of data captured by the linear models) balance through a carefully designed objective function and the optimization problem is solved with the accelerated proximal gradient method. Experiments show that MALC can effectively trade prediction accuracy for transparency and provide an efficient frontier that spans the entire spectrum of transparency.
No Free Lunch But A Cheaper Supper: A General Framework for Streaming Anomaly Detection
Calikus, Ece, Nowaczyk, Slawomir, Sant'Anna, Anita, Dikmen, Onur
In recent years, there has been increased research interest in detecting anomalies in temporal streaming data. A variety of algorithms have been developed in the data mining community, which can be divided into two categories (i.e., general and ad hoc). In most cases, general approaches assume the one-size-fits-all solution model where a single anomaly detector can detect all anomalies in any domain. To date, there exists no single general method that has been shown to outperform the others across different anomaly types, use cases and datasets. On the other hand, ad hoc approaches that are designed for a specific application lack flexibility. Adapting an existing algorithm is not straightforward if the specific constraints or requirements for the existing task change. In this paper, we propose SAFARI, a general framework formulated by abstracting and unifying the fundamental tasks in streaming anomaly detection, which provides a flexible and extensible anomaly detection procedure. SAFARI helps to facilitate more elaborate algorithm comparisons by allowing us to isolate the effects of shared and unique characteristics of different algorithms on detection performance. Using SAFARI, we have implemented various anomaly detectors and identified a research gap that motivates us to propose a novel learning strategy in this work. We conducted an extensive evaluation study of 20 detectors that are composed using SAFARI and compared their performances using real-world benchmark datasets with different properties. The results indicate that there is no single superior detector that works well for every case, proving our hypothesis that "there is no free lunch" in the streaming anomaly detection world. Finally, we discuss the benefits and drawbacks of each method in-depth and draw a set of conclusions to guide future users of SAFARI.
CyberSecurity: Machine Learning Artificial Intelligence Actionable Intelligence
Overview The goal of artificial intelligence is to enable the development of computers to do things normally done by people -- in particular, things associated with people acting intelligently. In the case of cybersecurity, its most practical application has been automating human intensive tasks to keep pace with attackers! Progressive organizations have begun using artificial intelligence in cybersecurity applications to defend against attackers. However, on it's own, artificial intelligence is best designed to identify "what is wrong." What today's enterprise needs to know is not only "what is wrong" in the face of a breach, but to understand "why it's wrong" and "how to fix it!"
Using theoretical ROC curves for analysing machine learning binary classifiers
Omar, Luma, Ivrissimtzis, Ioannis
Most binary classifiers work by processing the input to produce a scalar response and comparing it to a threshold value. The various measures of classifier performance assume, explicitly or implicitly, probability distributions $P_s$ and $P_n$ of the response belonging to either class, probability distributions for the cost of each type of misclassification, and compute a performance score from the expected cost. In machine learning, classifier responses are obtained experimentally and performance scores are computed directly from them, without any assumptions on $P_s$ and $P_n$. Here, we argue that the omitted step of estimating theoretical distributions for $P_s$ and $P_n$ can be useful. In a biometric security example, we fit beta distributions to the responses of two classifiers, one based on logistic regression and one on ANNs, and use them to establish a categorisation into a small number of classes with different extremal behaviours at the ends of the ROC curves.
Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models
Liu, Xinyi, Wangperawong, Artit
This study compares the effectiveness and robustness of multi-class categorization of Amazon product data using transfer learning on pre-trained contextualized language models. Specifically, we fine-tuned BERT and XLNet, two bidirectional models that have achieved state-of-the-art performance on many natural language tasks and benchmarks, including text classification. While existing classification studies and benchmarks focus on binary targets, with the exception of ordinal ranking tasks, here we examine the robustness of such models as the number of classes grows from 1 to 20. Our experiments demonstrate an approximately linear decrease in performance metrics (i.e., precision, recall, $F_1$ score, and accuracy) with the number of class labels. BERT consistently outperforms XLNet using identical hyperparameters on the entire range of class label quantities for categorizing products based on their textual descriptions. BERT is also more affordable than XLNet in terms of the computational cost (i.e., time and memory) required for training. In all cases studied, the performance degradation rates were estimated to be 1% per additional class label.