Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Mustafa, Akram, Naseem, Usman, Azghadi, Mostafa Rahimi

arXiv.org Artificial Intelligence 

Background: Clinical coding, particularly the classification of hierarchical ICD-10 codes from unstructured discharge summaries, is essential for healthcare operations, but remains a labor-intensive and error-prone task. Automated approaches using Large Language Models (LLMs) offer the potential to augment or replace human coders, yet their reliability and reasoning capabilities, which is needed to ensure accurate, explainable code assignments, are not well understood. Objective: This study aims to benchmark a diverse set of LLMs, both reasoning and non-reasoning models, on their ability to classify hierarchical ICD-10 codes from discharge summaries and evaluate the effect of structured reasoning on model performance. Methods: Using the MIMIC-IV dataset, the study selected 1,500 discharge summaries labeled with the top 10 most frequent ICD-10 codes, balancing dataset size with the high computational and financial cost of using LLMs. We first preprocessed the data to extract clinically relevant tokens before feeding it to the LLMs. Specifically, we used cTAKES, a clinical NLP tool, to identify medical concepts. Each summary was encoded and submitted to 11 LLMs using a standardized, structured prompt simulating a clinical coder. Models were evaluated using the F1 score across three ICD-10 levels for both primary and all diagnoses classification tasks. Reasoning models on average outperformed non-reasoning models. The Gemini 2.5 Pro model demonstrated the highest performance across tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found