Cui, Hejie
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
Dai, Wei, Chen, Peilin, Lu, Malinda, Li, Daniel, Wei, Haowen, Cui, Hejie, Liang, Paul Pu
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb.
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Cui, Hejie, Unell, Alyssa, Chen, Bowen, Fries, Jason Alan, Alsentzer, Emily, Koyejo, Sanmi, Shah, Nigam
Tasks such as chronic disease Large language models (LLMs) have emerged management, multi-visit care planning, and patient history as promising tools for assisting in medical tasks, synthesis require clinicians to understand complex relationships yet processing Electronic Health Records (EHRs) between different record entries and how past events presents unique challenges due to their longitudinal influence current and future clinical decisions (Wornow nature. While LLMs' capabilities to perform et al., 2024). The cognitive demands of processing such medical tasks continue to improve, their ability lengthy documentation are significant. While biomedical to reason over temporal dependencies across LLMs have shown promising results on well-structured multiple patient visits and time frames remains tasks like answering USMLE questions and medical knowledge unexplored. We introduce TIMER (Temporal retrieval (Singhal et al., 2023; Lu et al., 2024; Lucas Instruction Modeling and Evaluation for Longitudinal et al., 2024), recent evaluations reveal their significant limitations Clinical Records), a framework that incorporate in processing longitudinal patient information and in instruction-response pairs grounding to making clinical decisions over time (Hager et al., 2024; Bedi different parts of a patient's record as a critical et al., 2024). The gap between isolated question-answering dimension in both instruction evaluation and tuning performance and temporal reasoning ability impacts the for longitudinal clinical records. We develop practical utility of LLMs in healthcare. While there is some TIMER-Bench, the first time-aware benchmark prior work that has explored temporal understanding abilities that evaluates temporal reasoning capabilities over of general LLMs (Wang & Zhao, 2024; Fatemi et al., longitudinal EHRs, as well as TIMER-Instruct, 2024; Herel et al., 2024), how these capabilities scale to an instruction-tuning methodology for LLMs to longer contexts remains understudied, particularly in healthcare learn reasoning over time. We demonstrate that where longitudinal reasoning is important.
Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium
Adibi, Amin, Cao, Xu, Ji, Zongliang, Kaur, Jivat Neet, Chen, Winston, Healey, Elizabeth, Nuwagira, Brighton, Ye, Wenqian, Woollard, Geoffrey, Xu, Maxwell A, Cui, Hejie, Xi, Johnny, Chang, Trenton, Bikia, Vasiliki, Zhang, Nicole, Noori, Ayush, Xia, Yuan, Hossain, Md. Belal, Frank, Hanna A., Peluso, Alina, Pu, Yuan, Shen, Shannon Zejiang, Wu, John, Fallahpour, Adibvafa, Mahbub, Sazan, Duncan, Ross, Zhang, Yuwei, Cao, Yurui, Xu, Zuheng, Craig, Michael, Krishnan, Rahul G., Beheshti, Rahmatollah, Rehg, James M., Karim, Mohammad Ehsanul, Coffee, Megan, Celi, Leo Anthony, Fries, Jason Alan, Sadatsafavi, Mohsen, Shung, Dennis, McWeeney, Shannon, Dafflon, Jessica, Jabbour, Sarah
The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session's topic.
Biomedical Visual Instruction Tuning with Clinician Preference Alignment
Cui, Hejie, Mao, Lingjun, Liang, Xin, Zhang, Jieyu, Ren, Hui, Li, Quanzheng, Li, Xiang, Yang, Carl
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at BioMed-VITAL.github.io.
TACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data
Zhang, Ziyang, Cui, Hejie, Xu, Ran, Xie, Yuzhang, Ho, Joyce C., Yang, Carl
The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this work, we introduce TACCO, a novel framework that jointly discovers clusters of clinical concepts and patient visits based on a hypergraph modeling of EHR data. Specifically, we develop a novel self-supervised co-clustering framework that can be guided by the risk prediction task of specific diseases. Furthermore, we enhance the hypergraph model of EHR data with textual embeddings and enforce the alignment between the clusters of clinical concepts and patient visits through a contrastive objective. Comprehensive experiments conducted on the public MIMIC-III dataset and Emory internal CRADLE dataset over the downstream clinical tasks of phenotype classification and cardiovascular risk prediction demonstrate an average 31.25% performance improvement compared to traditional ML baselines and a 5.26% improvement on top of the vanilla hypergraph model without our co-clustering mechanism. In-depth model analysis, clustering results analysis, and clinical case studies further validate the improved utilities and insightful interpretations delivered by TACCO. Code is available at https://github.com/PericlesHat/TACCO.
LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction
Cui, Hejie, Shen, Zhuocheng, Zhang, Jieyu, Shao, Hui, Qin, Lianhui, Ho, Joyce C., Yang, Carl
Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit data (e.g., diagnoses, labs, prescriptions) into natural language narratives. We evaluate the zero-shot and few-shot performance of LLMs using various EHR-prediction-oriented prompting strategies. Furthermore, we propose a novel approach that utilizes LLM agents with different roles: a predictor agent that makes predictions and generates reasoning processes and a critic agent that analyzes incorrect predictions and provides guidance for improving the reasoning of the predictor agent. Our results demonstrate that with the proposed approach, LLMs can achieve decent few-shot performance compared to traditional supervised learning methods in EHR-based disease predictions, suggesting its potential for health-oriented applications. Introduction Large Language Models (LLMs) have emerged as a powerful tool in various domains, including healthcare.
Multimodal Fusion of EHR in Structures and Semantics: Integrating Clinical Records and Notes with Hypergraph and LLM
Cui, Hejie, Fang, Xinyu, Xu, Ran, Kan, Xuan, Ho, Joyce C., Yang, Carl
Electronic Health Records (EHRs) have become increasingly popular to support clinical decision-making and healthcare in recent decades. EHRs usually contain heterogeneous information, such as structural data in tabular form and unstructured data in textual notes. Different types of information in EHRs can complement each other and provide a more complete picture of the health status of a patient. While there has been a lot of research on representation learning of structured EHR data, the fusion of different types of EHR data (multimodal fusion) is not well studied. This is mostly because of the complex medical coding systems used and the noise and redundancy present in the written notes. In this work, we propose a new framework called MINGLE, which integrates both structures and semantics in EHR effectively. Our framework uses a two-level infusion strategy to combine medical concept semantics and clinical note semantics into hypergraph neural networks, which learn the complex interactions between different types of data to generate visit representations for downstream prediction. Experiment results on two EHR datasets, the public MIMIC-III and private CRADLE, show that MINGLE can effectively improve predictive performance by 11.83% relatively, enhancing semantic integration as well as multimodal fusion for structural and textual EHR data.
Microstructures and Accuracy of Graph Recall by Large Language Models
Wang, Yanbang, Cui, Hejie, Kleinberg, Jon
Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models
Xu, Ran, Cui, Hejie, Yu, Yue, Kan, Xuan, Shi, Wenqi, Zhuang, Yuchen, Jin, Wei, Ho, Joyce, Yang, Carl
Clinical natural language processing requires methods that can address domainspecific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domainspecific knowledge graphs and LLMs to guide data generation. Clinical Natural Language Processing (NLP) emerges as a distinct subfield including the extraction, analysis, and interpretation of medical data from unstructured clinical text (Wornow et al., 2023). Despite its significance, unique challenges evolve for methodology development in clinical NLP. For example, clinical texts are often dense with abbreviations and specialized medical terminologies that can be perplexing to standard NLP models (Cui et al., 2022; Lee et al., 2023). These progresses inspire the need for designing specialized approaches for adapting LLMs to clinical settings, which both address the terminology complexities and improve models through clinical data finetuning (Tu et al., 2023; Liu et al., 2023a). Despite the strong capacity of general LLMs, directly applying them to infer over clinical text data is often undesired in practice. Firstly, these LLMs often have billions of parameters that translate to significant computational resources even for inference, leading to increased infrastructure costs and long inference time. Furthermore, the sensitive patient information contained in the clinical text naturally raises privacy and regulatory compliance concerns (Meskó & Topol, 2023; Keeling, 2023).
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Cui, Hejie, Fang, Xinyu, Zhang, Zihan, Xu, Ran, Kan, Xuan, Liu, Xin, Yu, Yue, Li, Manling, Song, Yangqiu, Yang, Carl
Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.