A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Oct-28-2025–arXiv.org Artificial Intelligence

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibratio n and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert - validated dataset, systematically comparing two representati ve LLMs -- one general and one domain - specialized -- across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline - normalized Relative Calibration Error (RCE) that enable s reliable cross - model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role - playing conditions, with calibration error persisting above clinically relevant thresholds. These findings und erscore the need for improved calibration, transparent code, and strategic prompt engineering for the development of reliable and explainable medical AI.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-28-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - Strength High (1.00)
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Health & Medicine
  - Diagnostic Medicine (0.93)
  - Therapeutic Area
    - Neurology (0.64)
    - Psychiatry/Psychology (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found