Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset
Hegde, Shruti, Ninan, Mabon Manoj, Dillman, Jonathan R., Hayatghaibi, Shireen, Babcock, Lynn, Somasundaram, Elanchezhian
–arXiv.org Artificial Intelligence
A Pre - Purchase Evaluation and Comparative Study of Solutions from A WS, Google, Azure, John Snow Labs, and Open - Source Models on an Independent Pediatric Dataset Shruti Hegde MS, Mabon Manoj Ninan BS, Jonathan R. Dillman MD, MSc, Shireen Hayatghaibi PhD, Lynn Babcock MD, Elanchezhian Somasundaram PhD Abstract Purpose: General purpose clinical natural language processing tools are increasingly used for the automatic labeling of clinical reports to support various clinical, research and quality improvement applications. However, independent performance evaluations for specific tasks, such as labeling pediatric chest radiograph reports, remain scarce. This study aims to compare four leading commercial clinical NLP systems for entity extraction and assertion detection of clinically relevant findings in pediatric chest radiog raph reports . In addition, the study evaluates two dedicated chest radiograph report labelers, CheXpert and CheXbert, to provide a comprehensive performance comparison of the systems in extracting disease labels defined by CheXpert. Methods: A total of 95,008 pediatric chest radiograph (CXR) reports were obtained from a large academic pediatric hospital for this IRB - waived study. Clinically relevant terms were extracted using four general - purpose clinical NLP systems: Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) from John Snow Labs. After standardization, entities and their assertion statuses (positive, negative, uncertain) from the findings and impression sec tions were analyzed using descriptive statistics, paired t - tests, and Chi - square tests . Entities from the I mpression sections were mapped to 12 disease categories plus a No Findin gs category using a regular expression algorithm. In parallel, CheXpert and CheXbert processed the same reports to extract the same 13 categories (12 disease categories and a No Findings category) . Outputs from all six models were compared using Fleiss' Kappa across the assertion categories .
arXiv.org Artificial Intelligence
May-30-2025
- Country:
- North America > United States (0.04)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (1.00)
- Nuclear Medicine (1.00)
- Therapeutic Area (1.00)
- Health & Medicine
- Technology: