Goto

Collaborating Authors

 amie


Advancing Conversational Diagnostic AI with Multimodal Reasoning

Saab, Khaled, Freyberg, Jan, Park, Chunjong, Strother, Tim, Cheng, Yong, Weng, Wei-Hung, Barrett, David G. T., Stutz, David, Tomasev, Nenad, Palepu, Anil, Liévin, Valentin, Sharma, Yash, Ruparel, Roma, Ahmed, Abdullah, Vedadi, Elahe, Kanada, Kimberly, Hughes, Cian, Liu, Yun, Brown, Geoff, Gao, Yang, Li, Sean, Mahdavi, S. Sara, Manyika, James, Chou, Katherine, Matias, Yossi, Hassidim, Avinatan, Webster, Dale R., Kohli, Pushmeet, Eslami, S. M. Ali, Barral, Joëlle, Rodman, Adam, Natarajan, Vivek, Schaekermann, Mike, Tu, Tao, Karthikesalingam, Alan, Tanno, Ryutaro

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.


Towards Conversational AI for Disease Management

Palepu, Anil, Liévin, Valentin, Weng, Wei-Hung, Saab, Khaled, Stutz, David, Cheng, Yong, Kulkarni, Kavita, Mahdavi, S. Sara, Barral, Joëlle, Webster, Dale R., Chou, Katherine, Hassidim, Avinatan, Matias, Yossi, Manyika, James, Tanno, Ryutaro, Natarajan, Vivek, Rodman, Adam, Tu, Tao, Karthikesalingam, Alan, Schaekermann, Mike

arXiv.org Artificial Intelligence

While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.


Exploring Large Language Models for Specialist-level Oncology Care

Palepu, Anil, Dhillon, Vikram, Niravath, Polly, Weng, Wei-Hung, Prasad, Preethi, Saab, Khaled, Tanno, Ryutaro, Cheng, Yong, Mai, Hanh, Burns, Ethan, Ajmal, Zainub, Kulkarni, Kavita, Mansfield, Philip, Webster, Dale, Barral, Joelle, Gottweis, Juraj, Schaekermann, Mike, Mahdavi, S. Sara, Natarajan, Vivek, Karthikesalingam, Alan, Tu, Tao

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown remarkable progress in encoding clinical knowledge and responding to complex medical queries with appropriate clinical reasoning. However, their applicability in subspecialist or complex medical settings remains underexplored. In this work, we probe the performance of AMIE, a research conversational diagnostic AI system, in the subspecialist domain of breast oncology care without specific fine-tuning to this challenging domain. To perform this evaluation, we curated a set of 50 synthetic breast cancer vignettes representing a range of treatment-naive and treatment-refractory cases and mirroring the key information available to a multidisciplinary tumor board for decision-making (openly released with this work). We developed a detailed clinical rubric for evaluating management plans, including axes such as the quality of case summarization, safety of the proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery and hormonal therapy. To improve performance, we enhanced AMIE with the inference-time ability to perform web search retrieval to gather relevant and up-to-date clinical knowledge and refine its responses with a multi-stage self-critique pipeline. We compare response quality of AMIE with internal medicine trainees, oncology fellows, and general oncology attendings under both automated and specialist clinician evaluations. In our evaluations, AMIE outperformed trainees and fellows demonstrating the potential of the system in this challenging and important domain. We further demonstrate through qualitative examples, how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making. However, AMIE's performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses.


Linking Model Intervention to Causal Interpretation in Model Explanation

Cheng, Debo, Xu, Ziqi, Li, Jiuyong, Liu, Lin, Yu, Kui, Le, Thuc Duy, Liu, Jixue

arXiv.org Artificial Intelligence

Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervention effect has a causal interpretation, i.e., when it indicates whether a feature is a direct cause of the outcome. This work links the model intervention effect to the causal interpretation of a model. Such an interpretation capability is important since it indicates whether a machine learning model is trustworthy to domain experts. The conditions also reveal the limitations of using a model intervention effect for causal interpretation in an environment with unobserved features. Experiments on semi-synthetic datasets have been conducted to validate theorems and show the potential for using the model intervention effect for model interpretation.


Towards Democratization of Subspeciality Medical Expertise

O'Sullivan, Jack W., Palepu, Anil, Saab, Khaled, Weng, Wei-Hung, Cheng, Yong, Chu, Emily, Desai, Yaanik, Elezaby, Aly, Kim, Daniel Seung, Lan, Roy, Tang, Wilson, Tapaskar, Natalie, Parikh, Victoria, Jain, Sneha S., Kulkarni, Kavita, Mansfield, Philip, Webster, Dale, Gottweis, Juraj, Barral, Joelle, Schaekermann, Mike, Tanno, Ryutaro, Mahdavi, S. Sara, Natarajan, Vivek, Karthikesalingam, Alan, Ashley, Euan, Tu, Tao

arXiv.org Artificial Intelligence

The scarcity of subspecialist medical expertise, particularly in rare, complex and life-threatening diseases, poses a significant challenge for healthcare delivery. This issue is particularly acute in cardiology where timely, accurate management determines outcomes. We explored the potential of AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based experimental AI system optimized for diagnostic dialogue, to potentially augment and support clinical decision-making in this challenging context. We curated a real-world dataset of 204 complex cases from a subspecialist cardiology practice, including results for electrocardiograms, echocardiograms, cardiac MRI, genetic tests, and cardiopulmonary stress tests. We developed a ten-domain evaluation rubric used by subspecialists to evaluate the quality of diagnosis and clinical management plans produced by general cardiologists or AMIE, the latter enhanced with web-search and self-critique capabilities. AMIE was rated superior to general cardiologists for 5 of the 10 domains (with preference ranging from 9% to 20%), and equivalent for the rest. Access to AMIE's response improved cardiologists' overall response quality in 63.7% of cases while lowering quality in just 3.4%. Cardiologists' responses with access to AMIE were superior to cardiologist responses without access to AMIE for all 10 domains. Qualitative examinations suggest AMIE and general cardiologist could complement each other, with AMIE thorough and sensitive, while general cardiologist concise and specific. Overall, our results suggest that specialized medical LLMs have the potential to augment general cardiologists' capabilities by bridging gaps in subspecialty expertise, though further research and validation are essential for wide clinical utility.


Next-gen tech outsmarts doctors with more accurate diagnoses and better bedside manner: study

FOX News

A group of scientists from across the U.S. claim to have created the first artificial intelligence capable of generating AI without human supervision. A Google artificial intelligence system gave patients more accurate diagnoses and provided better bedside manner than traditional doctors, a recent study by the tech giant found. Actors portraying patients, unaware whether they were texting with real doctors or Google's Articulate Medical Intelligence Explorer (AMIE) overall preferred how the AI handled their medical conditions, according to the study, which was published Jan. 11 on the scholarly distribution site arXiv. A panel of doctors, meanwhile, also found AMIE to be more accurate at diagnosing the patients than actual physicians. "To our knowledge, this is the first time that a conversational AI system has ever been designed optimally for diagnostic dialogue and taking the clinical history," Alan Karthikesalingam, a clinical research scientist at Google Health in London and a co-author of the study, told the scientific journal Nature on Friday.


Towards Conversational Diagnostic AI

Tu, Tao, Palepu, Anil, Schaekermann, Mike, Saab, Khaled, Freyberg, Jan, Tanno, Ryutaro, Wang, Amy, Li, Brenna, Amin, Mohamed, Tomasev, Nenad, Azizi, Shekoofeh, Singhal, Karan, Cheng, Yong, Hou, Le, Webson, Albert, Kulkarni, Kavita, Mahdavi, S Sara, Semturs, Christopher, Gottweis, Juraj, Barral, Joelle, Chou, Katherine, Corrado, Greg S, Matias, Yossi, Karthikesalingam, Alan, Natarajan, Vivek

arXiv.org Artificial Intelligence

At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.