Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting

Hayat, Hashim, Kudrautsau, Maksim, Makarov, Evgeniy, Melnichenko, Vlad, Tsykunou, Tim, Varaksin, Piotr, Pavelle, Matt, Oskowitz, Adam Z.

Aug-1-2025–arXiv.org Artificial Intelligence

The CSS was accompanied by a natural language explanation of the scores. The LLM judge role used GPT-4.0 by OpenAI. Evaluation by Human Experts Each encounter pair in which the top diagnosis of AI and clinician did not match was evaluated by a board-certified physician with access to medical reference material. Blinding the physician to the origin of the documentation proved impractical, as the AI-based notes were highly consistent and thus easily recognized within a few pairs. The physician was asked to determine the cause of the disagreement between the documents, whether AI or the physician was more likely to be correct, whether it was not possible to determine which diagnosis was more appropriate, and whether the diagnoses did, in fact, match. Similarity and Style Metrics To evaluate how similar-or different the AI-generated (Doctronic) and clinician-generated SOAP notes were, we followed a two-step process. First, we assessed surface-level textual similarity using three standard statistical metrics: (1) TF IDF cosine similarity, which transforms each note into a weighted term-frequency vector and measures the cosine of the angle between them to capture word-frequency alignment; (2) the Jaccard index, which is the ratio of the intersection to the union of lowercased token sets, ranging from 0 (no overlap) to 1 (identical token sets); and (3) the Levenshtein ratio, a normalized edit-distance score based on character-level insertions, deletions, and substitutions that quantifies textual similarity on a 0-1 scale. These analyses demonstrated only minimal alignment in phrasing, formatting, and vocabulary. Then, to probe contextual and semantic similarity, we generated embeddings for each note using OpenAI's text embedding 3 small model and two versions of Biobert,

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Aug-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Francisco County > San Francisco (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Health Care Providers & Services (1.00)
  - Diagnostic Medicine (1.00)
  - Consumer Health (1.00)
  - Epidemiology (0.94)
  - Health Care Technology > Telehealth (0.69)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Rheumatology (1.00)
    - Gastroenterology (1.00)
    - Cardiology/Vascular Diseases (1.00)
    - Immunology (1.00)
    - Neurology (1.00)
    - Pulmonary/Respiratory Diseases (1.00)
    - Psychiatry/Psychology (1.00)
    - Musculoskeletal (0.93)
    - Otolaryngology (0.68)
    - Urology (0.68)
    - Nephrology (0.68)
    - Internal Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Diagnosis (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.44)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found