Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Naderi, Nariman, Safavi-Naini, Seyed Amir Ahmad, Savage, Thomas, Atf, Zahra, Lewis, Peter, Nadkarni, Girish, Soroush, Ali

Mar-24-2025–arXiv.org Artificial Intelligence

This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

confidence score, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Mar-24-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - New York > New York County
      - New York City (0.14)
  - Canada > Ontario
    - Durham Region > Oshawa (0.04)
- Asia
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report
  - Experimental Study (0.68)
  - New Finding (0.46)

Industry:
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Performance Analysis > Accuracy (1.00)
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found