Human-AI collectives produce the most accurate differential diagnoses
Zöller, N., Berger, J., Lin, I., Fu, N., Komarneni, J., Barabucci, G., Laskowski, K., Shia, V., Harack, B., Chu, E. A., Trianni, V., Kurvers, R. H. J. M., Herzog, S. M.
–arXiv.org Artificial Intelligence
Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate [1-4], lack common sense [5], and are biased [6, 7]--shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains [8] like medical diagnostics. Diagnostic errors are among the most pressing issues in medical practice [9-11], causing an estimated 795,000 deaths and permanent disabilities in the United States alone each year [12]. Reducing diagnostic errors--without incurring substantially higher costs--is essential to improve patient outcomes worldwide. This challenge has motivated a recent surge in diagnostic technologies exploiting artificial intelligence (AI) to interpret medical records, tests, and images [13, 14]. Deep learning approaches in medical imaging have shown great promise. Notable examples include mammography interpretation, cardiac function assessment, and lung cancer screening, some of which have progressed beyond the testing phase and entered clinical practice [15-17]. Recent years have also witnessed the rise of AI foundation models, especially LLMs, which show remarkable abilities to process natural language, providing accurate answers to questions in almost any domain, including medicine [18-21]. However, a recent meta-analysis [22] found that physicians often outperform LLMs, and that LLMs differ vastly in performance, also between medical specialties.
arXiv.org Artificial Intelligence
Jun-21-2024
- Country:
- Asia > Middle East
- Iran > Tehran Province
- Tehran (0.04)
- Republic of Türkiye > Samsun Province
- Samsun (0.04)
- Iran > Tehran Province
- Europe
- Germany
- Berlin (0.04)
- North Rhine-Westphalia > Cologne Region
- Cologne (0.04)
- Italy (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Sweden (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.14)
- Germany
- North America > United States
- California
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York > New York County
- New York City (0.04)
- Ohio (0.04)
- West Virginia (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Technology: