Goto

Collaborating Authors

 reference annotation


USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Ostmeier, Sophie, Axelrod, Brian, Bertels, Jeroen, Isensee, Fabian, Lansberg, Maarten G., Christensen, Soren, Albers, Gregory W., Li, Li-Jia, Heit, Jeremy J.

arXiv.org Artificial Intelligence

Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git


Automatic Pulmonary Lobe Segmentation Using Deep Learning

Tang, Hao, Zhang, Chupeng, Xie, Xiaohui

arXiv.org Artificial Intelligence

Pulmonary lobe segmentation is an important task for pulmonary disease related Computer Aided Diagnosis systems (CADs). Classical methods for lobe segmentation rely on successful detection of fissures and other anatomical information such as the location of blood vessels and airways. With the success of deep learning in recent years, Deep Convolutional Neural Network (DCNN) has been widely applied to analyze medical images like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), which, however, requires a large number of ground truth annotations. In this work, we release our manually labeled 50 CT scans which are randomly chosen from the LUNA16 dataset and explore the use of deep learning on this task. We propose pre-processing CT image by cropping region that is covered by the convex hull of the lungs in order to mitigate the influence of noise from outside the lungs. Moreover, we design a hybrid loss function with dice loss to tackle extreme class imbalance issue and focal loss to force model to focus on voxels that are hard to be discriminated. To validate the robustness and performance of our proposed framework trained with a small number of training examples, we further tested our model on CT scans from an independent dataset. Experimental results show the robustness of the proposed approach, which consistently improves performance across different datasets by a maximum of $5.87\%$ as compared to a baseline model.


Objective evaluation metrics for automatic classification of EEG events

Ziyabari, Saeedeh, Shah, Vinit, Golmohammadi, Meysam, Obeid, Iyad, Picone, Joseph

arXiv.org Machine Learning

The evaluation of machine learning algorithms in biomedical fields for applications involving sequential data lacks standardization. Common quantitative scalar evaluation metrics such as sensitivity and specificity can often be misleading depending on the requirements of the application. Evaluation metrics must ultimately reflect the needs of users yet be sufficiently sensitive to guide algorithm development. Feedback from critical care clinicians who use automated event detection software in clinical applications has been overwhelmingly emphatic that a low false alarm rate, typically measured in units of the number of errors per 24 hours, is the single most important criterion for user acceptance. Though using a single metric is not often as insightful as examining performance over a range of operating conditions, there is a need for a single scalar figure of merit. In this paper, we discuss the deficiencies of existing metrics for a seizure detection task and propose several new metrics that offer a more balanced view of performance. We demonstrate these metrics on a seizure detection task based on the TUH EEG Corpus. We show that two promising metrics are a measure based on a concept borrowed from the spoken term detection literature, Actual Term-Weighted Value, and a new metric, Time-Aligned Event Scoring (TAES), that accounts for the temporal alignment of the hypothesis to the reference annotation. We also demonstrate that state of the art technology based on deep learning, though impressive in its performance, still needs significant improvement before it will meet very strict user acceptance guidelines.