Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting
Hayat, Hashim, Kudrautsau, Maksim, Makarov, Evgeniy, Melnichenko, Vlad, Tsykunou, Tim, Varaksin, Piotr, Pavelle, Matt, Oskowitz, Adam Z.
–arXiv.org Artificial Intelligence
The CSS was accompanied by a natural language explanation of the scores. The LLM judge role used GPT-4.0 by OpenAI. Evaluation by Human Experts Each encounter pair in which the top diagnosis of AI and clinician did not match was evaluated by a board-certified physician with access to medical reference material. Blinding the physician to the origin of the documentation proved impractical, as the AI-based notes were highly consistent and thus easily recognized within a few pairs. The physician was asked to determine the cause of the disagreement between the documents, whether AI or the physician was more likely to be correct, whether it was not possible to determine which diagnosis was more appropriate, and whether the diagnoses did, in fact, match. Similarity and Style Metrics To evaluate how similar-or different the AI-generated (Doctronic) and clinician-generated SOAP notes were, we followed a two-step process. First, we assessed surface-level textual similarity using three standard statistical metrics: (1) TF IDF cosine similarity, which transforms each note into a weighted term-frequency vector and measures the cosine of the angle between them to capture word-frequency alignment; (2) the Jaccard index, which is the ratio of the intersection to the union of lowercased token sets, ranging from 0 (no overlap) to 1 (identical token sets); and (3) the Levenshtein ratio, a normalized edit-distance score based on character-level insertions, deletions, and substitutions that quantifies textual similarity on a 0-1 scale. These analyses demonstrated only minimal alignment in phrasing, formatting, and vocabulary. Then, to probe contextual and semantic similarity, we generated embeddings for each note using OpenAI's text embedding 3 small model and two versions of Biobert,
arXiv.org Artificial Intelligence
Aug-1-2025
- Country:
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine
- Consumer Health (1.00)
- Diagnostic Medicine (1.00)
- Epidemiology (0.94)
- Health Care Providers & Services (1.00)
- Health Care Technology > Telehealth (0.69)
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area
- Psychiatry/Psychology (1.00)
- Internal Medicine (0.68)
- Pulmonary/Respiratory Diseases (1.00)
- Nephrology (0.68)
- Neurology (1.00)
- Otolaryngology (0.68)
- Musculoskeletal (0.93)
- Immunology (1.00)
- Cardiology/Vascular Diseases (1.00)
- Urology (0.68)
- Gastroenterology (1.00)
- Rheumatology (1.00)
- Infections and Infectious Diseases (1.00)
- Health & Medicine
- Technology: