Collaborating Authors

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging Machine Learning

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model still consistently misses a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring and describing hidden stratification effects, and characterize these effects both on multiple medical imaging datasets and via synthetic experiments on the well-characterised CIFAR-100 benchmark dataset. We find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we explore the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

Quantifying Pulmonary Edema on Chest Radiographs


See article by Horng et al in this issue. William F. Auffermann, MD, PhD, is an associate professor of radiology and imaging sciences at the University of Utah School of Medicine. Dr Auffermann is a cardiothoracic radiologist and is ABPM board certified in clinical informatics. His research interests include imaging informatics, clinical informatics, applications of AI in radiology, medical image perception, and perceptual training. Recent research projects include image annotation for AI using eye tracking, human factors engineering, and developing simulation-based perceptual training methods to facilitate radiology education.

An Adversarial Approach for the Robust Classification of Pneumonia from Chest Radiographs Machine Learning

While deep learning has shown promise in the domain of disease classification from medical images, models based on state-of-the-art convolutional neural network architectures often exhibit performance loss due to dataset shift. Models trained using data from one hospital system achieve high predictive performance when tested on data from the same hospital, but perform significantly worse when they are tested in different hospital systems. Furthermore, even within a given hospital system, deep learning models have been shown to depend on hospital- and patient-level confounders rather than meaningful pathology to make classifications. In order for these models to be safely deployed, we would like to ensure that they do not use confounding variables to make their classification, and that they will work well even when tested on images from hospitals that were not included in the training data. We attempt to address this problem in the context of pneumonia classification from chest radiographs. We propose an approach based on adversarial optimization, which allows us to learn more robust models that do not depend on confounders. Specifically, we demonstrate improved out-of-hospital generalization performance of a pneumonia classifier by training a model that is invariant to the view position of chest radiographs (anterior-posterior vs. posterior-anterior). Our approach leads to better predictive performance on external hospital data than both a standard baseline and previously proposed methods to handle confounding, and also suggests a method for identifying models that may rely on confounders. Code available at

Deep metric learning for multi-labelled radiographs Machine Learning

Many radiological studies can reveal the presence of several co-existing abnormalities, each one represented by a distinct visual pattern. In this article we address the problem of learning a distance metric for plain radiographs that captures a notion of "radiological similarity": two chest radiographs are considered to be similar if they share similar abnormalities. Deep convolutional neural networks (DCNs) are used to learn a low-dimensional embedding for the radiographs that is equipped with the desired metric. Two loss functions are proposed to deal with multi-labelled images and potentially noisy labels. We report on a large-scale study involving over 745,000 chest radiographs whose labels were automatically extracted from free-text radiological reports through a natural language processing system. Using 4,500 validated exams, we demonstrate that the methodology performs satisfactorily on clustering and image retrieval tasks. Remarkably, the learned metric separates normal exams from those having radiological abnormalities.

Robust Classification from Noisy Labels: Integrating Additional Knowledge for Chest Radiography Abnormality Assessment Artificial Intelligence

Chest radiography is the most common radiographic examination performed in daily clinical practice for the detection of various heart and lung abnormalities. The large amount of data to be read and reported, with more than 100 studies per day for a single radiologist, poses a challenge in consistently maintaining high interpretation accuracy. The introduction of large-scale public datasets has led to a series of novel systems for automated abnormality classification. However, the labels of these datasets were obtained using natural language processed medical reports, yielding a large degree of label noise that can impact the performance. In this study, we propose novel training strategies that handle label noise from such suboptimal data. Prior label probabilities were measured on a subset of training data re-read by 4 board-certified radiologists and were used during training to increase the robustness of the training model to the label noise. Furthermore, we exploit the high comorbidity of abnormalities observed in chest radiography and incorporate this information to further reduce the impact of label noise. Additionally, anatomical knowledge is incorporated by training the system to predict lung and heart segmentation, as well as spatial knowledge labels. To deal with multiple datasets and images derived from various scanners that apply different post-processing techniques, we introduce a novel image normalization strategy. Experiments were performed on an extensive collection of 297,541 chest radiographs from 86,876 patients, leading to a state-of-the-art performance level for 17 abnormalities from 2 datasets. With an average AUC score of 0.880 across all abnormalities, our proposed training strategies can be used to significantly improve performance scores.