Goto

Collaborating Authors

 phenotype



Attentive State-Space Modeling of Disease Progression

Ahmed M. Alaa, Mihaela van der Schaar

Neural Information Processing Systems

Models of disease progression are instrumental forpredictingpatient outcomes and understandingdisease dynamics. Existing models provide the patient with pragmatic (supervised) predictions of risk, but do not provide the clinician with intelligible (unsupervised) representations ofdiseasepathology.



SolvingMin-MaxOptimizationwithHidden StructureviaGradientDescentAscent

Neural Information Processing Systems

Out of all the local Nash equilibria of HCC games, there exists a special subclass, the vectors(θθθ,φφφ) that implement the von Neumann solution of the convex-concave game.



C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference

Wang, Jing, Shen, Jie, Xie, Qiaomin, Weiss, Jeremy C

arXiv.org Machine Learning

Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emph{C-kNN--LSH}, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins'' with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emph{C-kNN-LSH} demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines.


A variational Bayes latent class approach for EHR-based patient phenotyping in R

Buckley, Brian, O'Hagan, Adrian, Galligan, Marie

arXiv.org Machine Learning

As regulatory agencies increasingly recognise real-world evidence as a complement to traditional clinical trial data, interest has grown in applying Bayesian methods across both interventional and observational research (Boulanger and Carlin (2021). A central objective in many clinical investigations is the delineation of patient subgroups that exhibit comparable disease-related characteristics (He, Belouali, Patricoski, Lehmann, Ball, Anagnostou, Kreimeyer, and Botsis (2023)). Electronic Health Records (EHR) have become an important resource for such phenotypic analyses (Hripcsak and Albers (2013)). Bayesian approaches to patient phenotyping in clinical observational studies have been limited by the computational challenges associated with applying the Markov Chain Monte Carlo (MCMC) approach to real-world data. Hubbard, Huang, Harton, Oganisian, Choi, Utidjian, Eneli, Bailey, and Chen (2019) proposed a Bayes latent class model that could be used in a general context for observational studies that use EHR data. They consider the common clinical context where gold-standard phenotype information, such as genetic and laboratory data, is not fully available. A general model of this form has high potential applicability for use in clinical decision support across disease areas for both primary and secondary clinical databases. Latent Class Analysis (LCA) is widely used when we want to identify patient phenotypes or subgroups given multivariate data (Lanza and Rhoades (2013)). A challenge in clinical LCA is the prevalence of mixed data, where we may have combinations of continuous, nominal, ordinal and count data.


RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

Yang, Yang, Pollak, Kathryn I., Chakraborty, Bibhas, Liu, Molei, Zhou, Doudou, Hong, Chuan

arXiv.org Artificial Intelligence

Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction.


TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis

Peng, Rui, Liu, Ziru, Ye, Lingyuan, Lu, Yuxing, Shi, Boxin, Wang, Jinzhuo

arXiv.org Artificial Intelligence

Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as Perturbation $\rightarrow$ RNA or Perturbation $\rightarrow$ Morphology, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome-phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.


A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases

Yang, Tao, Huang, Dandan, Lin, Yunting, Wu, Pengfei, Wu, Zhikun, Ma, Gangyuan, Lu, Yulan, Dong, Xinran, Li, Dingpeng, Ge, Junshuang, Zhang, Zhiyan, Huang, Xuanzhao, Nong, Wenyan, Zhou, Yao, Tang, Hui, Yang, Hongxi, Zhang, Shijie, Li, Juan, Cao, Xiaojun, Yang, Lin, Gao, Xia, Xu, Kaishou, Gu, Xiaoqiong, Zhang, Wen, Xia, Huimin, Liu, Li, Zhou, Wenhao, Li, Mulin Jun

arXiv.org Artificial Intelligence

W e assemble a large, domain - specialized clinical corpus and a clinician - validated reasoning set, and develop RareSeek - R1 via staged instruction tuning, chain - of - thought learning, and graph - grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek - R1 attains state - of - the - art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non - phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative - first, knowledge - integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.