Ong, Hanley
A foundation model for human-AI collaboration in medical literature mining
Wang, Zifeng, Cao, Lang, Jin, Qiao, Chan, Joey, Wan, Nicholas, Afzali, Behdad, Cho, Hyun-Jin, Choi, Chang-In, Emamverdi, Mehdi, Gill, Manjot K., Kim, Sun-Hyung, Li, Yijia, Liu, Yi, Ong, Hanley, Rousseau, Justin, Sheikh, Irfan, Wei, Jenny J., Xu, Ziyang, Zallek, Christopher M., Kim, Kyungsang, Peng, Yifan, Lu, Zhiyong, Sun, Jimeng
Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.
Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels
Wei, Yishu, Wang, Xindi, Ong, Hanley, Zhou, Yiliang, Flanders, Adam, Shih, George, Peng, Yifan
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs
Zhou, Yiliang, Ong, Hanley, Kennedy, Patrick, Wu, Carol, Kazam, Jacob, Hentel, Keith, Flanders, Adam, Shih, George, Peng, Yifan
Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V)[1], has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate GPT-4V's capability to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard.