correct diagnosis
Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance
Srinivasan, Tejas, Thomason, Jesse
Trust biases how users rely on AI recommendations in AI-assisted decision-making tasks, with low and high levels of trust resulting in increased under- and over-reliance, respectively. We propose that AI assistants should adapt their behavior through trust-adaptive interventions to mitigate such inappropriate reliance. For instance, when user trust is low, providing an explanation can elicit more careful consideration of the assistant's advice by the user. In two decision-making scenarios -- laypeople answering science questions and doctors making medical diagnoses -- we find that providing supporting and counter-explanations during moments of low and high trust, respectively, yields up to 38% reduction in inappropriate reliance and 20% improvement in decision accuracy. We are similarly able to reduce over-reliance by adaptively inserting forced pauses to promote deliberation. Our results highlight how AI adaptation to user trust facilitates appropriate reliance, presenting exciting avenues for improving human-AI collaboration.
Adaptive Reasoning and Acting in Medical Language Agents
Dutta, Abhishek, Hsiao, Yen-Che
This paper presents an innovative large language model (LLM) agent framework for enhancing diagnostic accuracy in simulated clinical environments using the AgentClinic benchmark. The proposed automatic correction enables doctor agents to iteratively refine their reasoning and actions following incorrect diagnoses, fostering improved decision-making over time. Experiments show that the implementation of the adaptive LLM-based doctor agents achieve correct diagnoses through dynamic interactions with simulated patients. The evaluations highlight the capacity of autonomous agents to adapt and improve in complex medical scenarios. Future enhancements will focus on refining the algorithm and expanding its applicability across a wider range of tasks and different large language models.
Human-AI collectives produce the most accurate differential diagnoses
Zรถller, N., Berger, J., Lin, I., Fu, N., Komarneni, J., Barabucci, G., Laskowski, K., Shia, V., Harack, B., Chu, E. A., Trianni, V., Kurvers, R. H. J. M., Herzog, S. M.
Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate [1-4], lack common sense [5], and are biased [6, 7]--shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains [8] like medical diagnostics. Diagnostic errors are among the most pressing issues in medical practice [9-11], causing an estimated 795,000 deaths and permanent disabilities in the United States alone each year [12]. Reducing diagnostic errors--without incurring substantially higher costs--is essential to improve patient outcomes worldwide. This challenge has motivated a recent surge in diagnostic technologies exploiting artificial intelligence (AI) to interpret medical records, tests, and images [13, 14]. Deep learning approaches in medical imaging have shown great promise. Notable examples include mammography interpretation, cardiac function assessment, and lung cancer screening, some of which have progressed beyond the testing phase and entered clinical practice [15-17]. Recent years have also witnessed the rise of AI foundation models, especially LLMs, which show remarkable abilities to process natural language, providing accurate answers to questions in almost any domain, including medicine [18-21]. However, a recent meta-analysis [22] found that physicians often outperform LLMs, and that LLMs differ vastly in performance, also between medical specialties.
Towards Accurate Differential Diagnosis with Large Language Models
McDuff, Daniel, Schaekermann, Mike, Tu, Tao, Palepu, Anil, Wang, Amy, Garrison, Jake, Singhal, Karan, Sharma, Yash, Azizi, Shekoofeh, Kulkarni, Kavita, Hou, Le, Cheng, Yong, Liu, Yun, Mahdavi, S Sara, Prakash, Sushant, Pathak, Anupam, Semturs, Christopher, Patel, Shwetak, Webster, Dale R, Dominowska, Ewa, Gottweis, Juraj, Barral, Joelle, Chou, Katherine, Corrado, Greg S, Matias, Yossi, Sunshine, Jake, Karthikesalingam, Alan, Natarajan, Vivek
An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.
The Case Records of ChatGPT: Language Models and Complex Clinical Questions
Poterucha, Timothy, Elias, Pierre, Haggerty, Christopher M.
Background: Artificial intelligence language models have shown promise in various applications, including assisting with clinical decision-making as demonstrated by strong performance of large language models on medical licensure exams. However, their ability to solve complex, open-ended cases, which may be representative of clinical practice, remains unexplored. Methods: In this study, the accuracy of large language AI models GPT4 and GPT3.5 in diagnosing complex clinical cases was investigated using published Case Records of the Massachusetts General Hospital. A total of 50 cases requiring a diagnosis and diagnostic test published from January 1, 2022 to April 16, 2022 were identified. For each case, models were given a prompt requesting the top three specific diagnoses and associated diagnostic tests, followed by case text, labs, and figure legends. Model outputs were assessed in comparison to the final clinical diagnosis and whether the model-predicted test would result in a correct diagnosis. Results: GPT4 and GPT3.5 accurately provided the correct diagnosis in 26% and 22% of cases in one attempt, and 46% and 42% within three attempts, respectively. GPT4 and GPT3.5 provided a correct essential diagnostic test in 28% and 24% of cases in one attempt, and 44% and 50% within three attempts, respectively. No significant differences were found between the two models, and multiple trials with identical prompts using the GPT3.5 model provided similar results. Conclusions: In summary, these models demonstrate potential usefulness in generating differential diagnoses but remain limited in their ability to provide a single unifying diagnosis in complex, open-ended cases. Future research should focus on evaluating model performance in larger datasets of open-ended clinical challenges and exploring potential human-AI collaboration strategies to enhance clinical decision-making.
Interview with Rose Nakasi: using machine learning and smartphones to help diagnose malaria
Rose Nakasi and her colleagues have developed a machine-learning method to detect malaria parasites in blood samples. We spoke to Rose about the motivation for this project, the progress so far, and what they are planning next. The problem that we are trying to solve concerns the microscopy of malaria diagnosis. The motivation for this research is that malaria is one of the most highly endemic diseases in sub-Saharan Africa, Uganda included. The major problem is that the gold-standard confirmatory test for diagnosis is by use of a microscope, and in our setting, we have a shortage of skilled lab microscopists that are able to carry out the correct diagnosis of the disease.
Stop Saying You "Could Never Do Science"
When I tell people that I'm majoring in molecular biology, I usually get a response that's something like this: "I could never do biology." Or worse: "I could never do science." This seems to be a common response that people in the sciences get when they talk about how they spend their time in school or at work. My friend who majors in psychology, my editor who has a physics degree, my high school mentor with a doctorate in neuroscience--they all tell me that they get some version of the "that would be too hard for me!" response when they share their credentials. It feels like I might be gearing up for years and years of being on the receiving end of the "wow, that's too hard for me" response. Every time I hear it I want to yell, No! Stop! Have some faith in yourself!
Detecting Diabetic Nephropathy with AI - Vanderbilt Discover
Diabetic nephropathy (DN) is defined by elevated urine albumin excretion or reduced glomerular filtration rate (GFR), or both. While DN may be diagnosed clinically, pathology is often needed to confirm the diagnosis and establish the severity of the injury. "In addition to making the correct diagnosis of diabetic nephropathy, we want to be able to assess the severity of the injury." Pathologists usually classify DN based on a visual assessment of glomerular pathology using immunofluorescence microscopy and electron microscopy. Although diagnostic guidelines have been well established, scoring of severity of the lesions may vary among pathologists.
UCLA Jonsson Comprehensive Cancer Center : Latest News
UCLA researchers have developed an artificial intelligence system that could help pathologists read biopsies more accurately and to better detect and diagnose breast cancer. The new system, described in a study published today in JAMA Network Open, helps interpret medical images used to diagnose breast cancer that can be difficult for the human eye to classify, and it does so nearly as accurately or better as experienced pathologists. "It is critical to get a correct diagnosis from the beginning so that we can guide patients to the most effective treatments," said Dr. Joann Elmore, the study's senior author and a professor of medicine at the David Geffen School of Medicine at UCLA. A 2015 study led by Elmore found that pathologists often disagree on the interpretation of breast biopsies, which are performed on millions of women each year. That earlier research revealed that diagnostic errors occurred in about one out of every six women who had ductal carcinoma in situ (a noninvasive type of breast cancer), and that incorrect diagnoses were given in about half of the biopsy cases of breast atypia (abnormal cells that are associated with a higher risk for breast cancer).
Artificial intelligence could yield more accurate breast cancer diagnoses: System can interpret images that are challenging for doctors to classify
The new system, described in a study published in JAMA Network Open, helps interpret medical images used to diagnose breast cancer that can be difficult for the human eye to classify, and it does so nearly as accurately or better as experienced pathologists. "It is critical to get a correct diagnosis from the beginning so that we can guide patients to the most effective treatments," said Dr. Joann Elmore, the study's senior author and a professor of medicine at the David Geffen School of Medicine at UCLA. A 2015 study led by Elmore found that pathologists often disagree on the interpretation of breast biopsies, which are performed on millions of women each year. That earlier research revealed that diagnostic errors occurred in about one out of every six women who had ductal carcinoma in situ (a noninvasive type of breast cancer), and that incorrect diagnoses were given in about half of the biopsy cases of breast atypia (abnormal cells that are associated with a higher risk for breast cancer). "Medical images of breast biopsies contain a great deal of complex data and interpreting them can be very subjective," said Elmore, who is also a researcher at the UCLA Jonsson Comprehensive Cancer Center.