Classification of radiology reports by modality and anatomy: A comparative study

Bendersky, Marina, Wu, Joy, Syeda-Mahmood, Tanveer

arXiv.org Machine Learning 

Abstract--Data labeling is currently a time-consuming task that often requires expert knowledge. In research settings, the availability of correctly labeled data is crucial to ensure that model predictions are accurate and useful. We propose relatively simplemachine learning-based models that achieve high performance metrics in the binary and multiclass classification of radiology reports. We compare the performance of these algorithms to that of a data-driven approach based on NLP, and find that the logistic regression classifier outperforms all other models, in both the binary and multiclass classification tasks. We then choose the logistic regression binary classifier to predict chest X-ray (CXR)/ non-chest X-ray (non-CXR) labels in reports from different datasets, unseen during any training phase of any of the models. Even in unseen report collections, the binary logistic regression classifier achieves average precision values of above 0.9. Based on the regression coefficient values, we also identify frequent tokens in CXR and non-CXR reports that are features with possibly high predictive power. I. INTRODUCTION Large data collections that can be comprised of text, images oreven video, are becoming more easily available to researchers, clinicians and the public in general. It is quite often necessary, as a critical initial step, to mine input data before proceeding to further research or analysis. In a research setting, careful and accurate data labeling can be a tedious and time-consuming task that often requires manual inputs and expert knowledge. Moreover, the same dataset might need to be relabeled multiple times, not only in cases where the same dataset is used for different research purposes but also in cases where the data is mislabeled. Mislabeled data [1] produces in itself at least 2 new problems; first,the mislabeled data needs to be identified and differentiated from correctly labeled data [1, 2], and second, the mislabeled data should be corrected or removed from the dataset (if possible).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found