Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
Lu, Dingxin, Wu, Shurui, Huang, Xinyi
–arXiv.org Artificial Intelligence
With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model (LLM) inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer decoder through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- North America > United States
- Illinois > Cook County
- Chicago (0.04)
- New York > New York County
- New York City (0.04)
- Illinois > Cook County
- North America > United States
- Genre:
- Research Report > Experimental Study (0.34)
- Industry:
- Technology: