Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Hegde, Shruti, Ninan, Mabon Manoj, Dillman, Jonathan R., Hayatghaibi, Shireen, Babcock, Lynn, Somasundaram, Elanchezhian

May-30-2025–arXiv.org Artificial Intelligence

A Pre - Purchase Evaluation and Comparative Study of Solutions from A WS, Google, Azure, John Snow Labs, and Open - Source Models on an Independent Pediatric Dataset Shruti Hegde MS, Mabon Manoj Ninan BS, Jonathan R. Dillman MD, MSc, Shireen Hayatghaibi PhD, Lynn Babcock MD, Elanchezhian Somasundaram PhD Abstract Purpose: General purpose clinical natural language processing tools are increasingly used for the automatic labeling of clinical reports to support various clinical, research and quality improvement applications. However, independent performance evaluations for specific tasks, such as labeling pediatric chest radiograph reports, remain scarce. This study aims to compare four leading commercial clinical NLP systems for entity extraction and assertion detection of clinically relevant findings in pediatric chest radiog raph reports . In addition, the study evaluates two dedicated chest radiograph report labelers, CheXpert and CheXbert, to provide a comprehensive performance comparison of the systems in extracting disease labels defined by CheXpert. Methods: A total of 95,008 pediatric chest radiograph (CXR) reports were obtained from a large academic pediatric hospital for this IRB - waived study. Clinically relevant terms were extracted using four general - purpose clinical NLP systems: Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) from John Snow Labs. After standardization, entities and their assertion statuses (positive, negative, uncertain) from the findings and impression sec tions were analyzed using descriptive statistics, paired t - tests, and Chi - square tests . Entities from the I mpression sections were mapped to 12 disease categories plus a No Findin gs category using a regular expression algorithm. In parallel, CheXpert and CheXbert processed the same reports to extract the same 13 categories (12 disease categories and a No Findings category) . Outputs from all six models were compared using Fleiss' Kappa across the assertion categories .

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

May-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Health & Medicine
  - Therapeutic Area (1.00)
  - Nuclear Medicine (1.00)
  - Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Rule-Based Reasoning (0.68)
  - Natural Language
    - Text Processing (0.91)
    - Large Language Model (0.68)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found